Author: Chow Chun Mun¶
Practical Examination in Actuarial Data Science Immersion, 13 April – 13 May 2025¶
Part II (Python)¶
Diabetes Hospital Readmission: Focus on Neural Networks and Embeddings¶
IMPORTANT NOTES ON COMPLETING THIS PART OF THE TASK
The second part of the exam is to be implemented in Python. An existing Jupyter notebook, which has been deemed suitable, should be used as a template and adapted to different data and modified questions.
The notebook for the use case “Forecasting Rare Events: Credit Scoring” of DAV (see https://aktuar.de/en/knowledge/specialist-information/detail/forecasting-rare-events-credit-scoring/) serves as
a template for the tasks PT1, PT2, PT3, and PT7 b). It must be modified according to the task requirements
(i.e., texts, code, and notebook cells should be appropriately modified, added, removed, or commented out).
The notebook to be used as a template is in an amended and slightly edited version named
template2-credit-scoring-english.ipynb in the examination materials and will hereafter be referred to
by the abbreviation CSN. Remark: This notebook.
Tasks PT0 and PT4 to PT7 a) cover new topics that have not been addressed in the notebook template. These tasks require the addition of new notebook cells, and the notebook must be expanded accordingly.
The datasets diabetic_data_bin.csv and icd9_data.csv, generated in Task R1 of Part I, are required for
further processing.
When making revisions, ensure that section numbers from the original notebook CSN remain unchanged (in the first three tasks PT1 to PT3). The corresponding headings can be adjusted as needed. The texts taken from the template must be adapted to reflect the modified context (e.g., data and results). Unnecessary text can be removed.
Task PT0: Setting Up a Development Environment [Learning objectives: 5.1 — Points: 1]]¶
For setting up the development environment, Python 3.10.16 is recommended. The examination materials
include a requirements.txt file, which contains the necessary packages. Based on this file, the required
packages can be installed. The Python version used must be displayed. Additionally, the installed packages
must be listed clearly (five packages with version numbers per line)
Solution:
First of all, we create a Python environment and install the required dependencies.
The following description assumes that Python (version 3.13.3) along with Anaconda has already been installed on your computer. Now, to set up a Python environment with Python 3.10.16 for the following code and install dependencies via the command line, proceed as follows:
- Open a terminal and navigate to the directory where your project is located.
- Use the following command to create a virtual environment with Python 3.10.16:
conda create --name cads3 python=3.10.16
- Activate the virtual environment with the corresponding command:
conda activate cads3
- Install the dependencies from the requirements.txt file given:
pip install -r requirements.txt
Display the Python version:
!python --version
Python 3.10.16
Display the list of installed packages:
# Import the package `subprocess` to run shell commands and capture its output
import subprocess
# Run the pip list command to get the installed packages in the format `package==version` with --format=freeze
result = subprocess.run(["pip", "list", "--format=freeze"],
capture_output=True,
text=True) # Return output as a string (instead of bytes)
# Split the output by lines and store it in a list
package_list = result.stdout.splitlines()
# Calculate the maximum length of the package-version strings for alignment of the output
max_package_length = max(len(pkg) for pkg in package_list)
# Print the packages in a nicely left-aligned format (5 per line)
print("List of installed packages:\n")
for i in range(0, len(package_list), 5):
line = ""
for pkg in package_list[i:i + 5]:
# Align package-version pairs in columns
line += f"{pkg.ljust(max_package_length)}" + "\t"
print(line)
List of installed packages: absl-py==2.1.0 asttokens==3.0.0 astunparse==1.6.3 catboost==1.2.7 certifi==2024.12.14 charset-normalizer==3.4.0 colorama==0.4.6 comm==0.2.2 contourpy==1.3.1 cycler==0.12.1 debugpy==1.8.11 decorator==5.1.1 exceptiongroup==1.2.2 executing==2.1.0 flatbuffers==24.3.25 fonttools==4.55.3 gast==0.6.0 google-pasta==0.2.0 graphviz==0.20.3 grpcio==1.68.1 h5py==3.12.1 idna==3.10 importlib_metadata==8.5.0 ipykernel==6.29.5 ipython==8.31.0 jedi==0.19.2 Jinja2==3.1.5 joblib==1.4.2 jupyter_client==8.6.3 jupyter_core==5.7.2 keras==3.7.0 kiwisolver==1.4.7 libclang==18.1.1 lightgbm==4.5.0 Markdown==3.7 markdown-it-py==3.0.0 MarkupSafe==3.0.2 matplotlib==3.10.0 matplotlib-inline==0.1.7 mdurl==0.1.2 ml-dtypes==0.4.1 namex==0.0.8 nest-asyncio==1.6.0 numpy==1.26.4 opt_einsum==3.4.0 optree==0.13.1 packaging==24.2 pandas==2.2.3 parso==0.8.4 patsy==1.0.1 pexpect==4.9.0 pickleshare==0.7.5 pillow==11.0.0 pip==24.3.1 platformdirs==4.3.6 plotly==5.24.1 prompt_toolkit==3.0.48 protobuf==5.29.2 psutil==6.1.0 ptyprocess==0.7.0 pure_eval==0.2.3 Pygments==2.18.0 pyparsing==3.2.0 python-dateutil==2.9.0.post0 pytz==2024.2 pywin32==310 pyzmq==26.2.0 requests==2.32.3 rich==13.9.4 scikit-learn==1.5.2 scipy==1.14.1 seaborn==0.13.2 setuptools==75.6.0 six==1.17.0 stack-data==0.6.3 statsmodels==0.14.4 tenacity==9.0.0 tensorboard==2.18.0 tensorboard-data-server==0.7.2 tensorflow==2.18.0 tensorflow_intel==2.18.0 tensorflow-io-gcs-filesystem==0.31.0 termcolor==2.5.0 threadpoolctl==3.5.0 tornado==6.4.2 traitlets==5.14.3 typing_extensions==4.12.2 tzdata==2024.2 urllib3==2.3.0 wcwidth==0.2.13 Werkzeug==3.1.3 wheel==0.45.1 wrapt==1.17.0 xgboost==2.1.3 zipp==3.21.0
Task PT1: Creating a Good Benchmark Model as Simply as Possible [Learning objectives: 5.1 — Points: 4] ¶
Based on the CSN notebook (Part A), the following adjustments must be made:
- Remove all cells before the text “Part A: Quick & Easy” that relate to the original topic of credit scoring.
- Within "Part A: Quick & Easy":
- Include all packages used in the notebook in Section 1.1, and remove any unnecessary imports. Use a RANDOM SEED of 42.
- Replace the dataset "../input/home-credit-default-risk/application train.csv" used in Section 1.2 with the dataset diabetic data bin.csv created in Task R1 of Part I. What modifications are necessary to correctly load the dataset?
- Set the data type of every feature read in that has a name ending with “ id” to “object.” What advantage does this approach offer?
- For Part A, no additional modifications to the dataset are required beyond those already present in the notebook. If certain code blocks are not needed, comment them out with an appropriate note. Reminder: Keep in mind here and throughout the rest of the notebook the general guideline that texts must be
adapted to the dataset.
Solution:
In Part A, we establish a baseline for model performance using a straightforward yet effective CatBoost algorithm that provides a quick insight into the predictive power of our dataset. As we progress through Parts B and C, we will explore more advanced models and optimization methods to enhance our predictions and better understand hospital readmission behavior for diabetic patients.
In this very first section, we will quickly build a basic machine learning model for the binary classification task of predicting hospital readmission within 30 days. This baseline model will serve as a benchmark for all subsequent machine learning models.
1.1 Setting Up the Modeling Environment
We begin by importing all required libraries, calibrating visualization parameters, and setting a random seed to ensure the reproducibility of our results.
import os
# Set environment variable to ignore all warnings
os.environ['PYTHONWARNINGS'] = 'ignore'
# Standard Python libraries for data analysis, scientific computing, and plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import re
import random
from scipy.stats import uniform, loguniform
# Setting display options for pandas and matploblib
pd.set_option('display.max_columns', None)
COLOR_LIGHT, COLOR_DARK = '#849CBE', '#00548A'
# Scikit-learn modules for preprocessing and machine learning models
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
# Gradient boosting frameworks and logistic regression
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import statsmodels.formula.api as smf
# Tensorflow and Keras for artificial neural networks
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Embedding, Flatten, Concatenate, Dropout
from tensorflow.keras.utils import set_random_seed
# Ensure reproducibility
RANDOM_SEED = 42
set_random_seed(RANDOM_SEED)
To optimize resource usage, we adjust TensorFlow's GPU memory settings, enabling efficient computation throughout our modeling workflow.
However, upon reviewing the packages specified in the requirements.txt, we decide not to use GPU for the time being and, as a result, comment out the relevant code.
# # Prevent TensorFlow From Fully Allocating GPU Memory
# # Ref: https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
# gpus = tf.config.list_physical_devices('GPU')
# if gpus:
# try:
# # Currently, memory growth needs to be the same across GPU
# for gpu in gpus:
# tf.config.experimental.set_memory_growth(gpu, True)
# logical_gpus = tf.config.list_logical_devices('GPU')
# print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
# except RuntimeError as e:
# # Memory growth must be set before GPUs have been initialized
# print(e)
1.2 Load and examine the input data
Next, we load the mentioned dataset diabetic_data_bin.csv. In order to prevent pandas from interpreting certain strings such as "None" as a missing value, and since our dataset has no NAvalues as generated in Part I, we disable automatic NA detection via na_filter=False.
Then, we display its dimensions by printing the number of rows and columns. We also present several randomly selected rows to illustrate its content and its data structure.
# Load the dataset into a pandas DataFrame without interpreting any strings as NaN
df_raw = pd.read_csv("../02_tidy_data/diabetic_data_bin.csv", # UPDATE this path if needed
na_filter=False) # Ensures all values are read as-is (no auto-NaN)
# Display the dimensions of the dataset (rows, columns)
print(f"Input Dataset dimensions (rows, columns): {df_raw.shape}")
# Display a random sample of 7 rows from the dataset to inspect data
df_raw.sample(n=7, random_state=RANDOM_SEED)
Input Dataset dimensions (rows, columns): (47751, 51)
| race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | TARGET | group_diag_1 | group_diag_2 | group_diag_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47523 | Asian | Female | [50-60) | ? | 3 | 6 | 1 | 1 | ? | Orthopedics | 18 | 4 | 6 | 0 | 0 | 0 | 722 | 250 | 401 | 4 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 0 | Musculoskeletal | Diabetes | Circulatory |
| 8582 | Caucasian | Male | [40-50) | ? | 2 | 6 | 4 | 8 | ? | ? | 45 | 6 | 53 | 0 | 0 | 0 | 410 | 414 | 401 | 7 | None | None | No | No | No | No | No | No | No | Up | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Ch | Yes | 0 | Circulatory | Circulatory | Circulatory |
| 42024 | Caucasian | Female | [70-80) | ? | 1 | 3 | 7 | 3 | MC | ? | 47 | 0 | 13 | 0 | 0 | 0 | 25012 | 427 | 276 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 0 | Diabetes | Circulatory | Other |
| 15219 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ? | InternalMedicine | 35 | 2 | 7 | 0 | 0 | 0 | 25013 | 280 | 789 | 5 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | Yes | 0 | Diabetes | Other | Neoplasms |
| 45649 | Caucasian | Female | [60-70) | ? | 1 | 2 | 7 | 1 | ? | ? | 23 | 0 | 11 | 0 | 0 | 0 | 410 | 414 | 250 | 5 | None | >8 | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | 1 | Circulatory | Circulatory | Diabetes |
| 23448 | Caucasian | Male | [80-90) | ? | 3 | 28 | 1 | 1 | SP | ? | 55 | 0 | 13 | 0 | 0 | 0 | 294 | 312 | 414 | 5 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 1 | Other | Other | Circulatory |
| 39945 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 4 | ? | ? | 56 | 2 | 16 | 1 | 0 | 0 | 535 | 584 | 25041 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 1 | Other | Genitourinary | Diabetes |
Now, we set the data type of every feature with a name ending in "_id" to object.
The advantage of setting "_id" fields as object during data loading is that IDs are typically used as identifiers. If kept as integers or floats, they might be misinterpreted during processing (e.g., statistics, scaling, etc.) and modeling. Additionally, IDs like "00123", if read as numbers, would become "123", losing their leading zeros. Using object ensures they are treated as categorical or string-like, preserving their meaning and formatting while preventing unintended calculations.
# Find columns whose names end with "_id"
id_columns = [col for col in df_raw.columns if col.endswith("id")]
# Convert their types to 'object'
df_raw[id_columns] = df_raw[id_columns].astype('object')
We output now the data types of each column of our dataset to check that they are set correctly.
# Format column names and dtypes
dtypes_list = [f"{col:<25}: {str(dtype):<10}" for col, dtype in df_raw.dtypes.items()]
# Print the column types in a nicely left-aligned format (5 per line)
print("List of data features and their types:\n")
for i in range(0, len(dtypes_list), 5):
print(" ".join(dtypes_list[i:i+5]))
List of data features and their types: race : object gender : object age : object weight : object admission_type_id : object discharge_disposition_id : object admission_source_id : object time_in_hospital : int64 payer_code : object medical_specialty : object num_lab_procedures : int64 num_procedures : int64 num_medications : int64 number_outpatient : int64 number_emergency : int64 number_inpatient : int64 diag_1 : object diag_2 : object diag_3 : object number_diagnoses : int64 max_glu_serum : object A1Cresult : object metformin : object repaglinide : object nateglinide : object chlorpropamide : object glimepiride : object acetohexamide : object glipizide : object glyburide : object tolbutamide : object pioglitazone : object rosiglitazone : object acarbose : object miglitol : object troglitazone : object tolazamide : object examide : object citoglipton : object insulin : object glyburide-metformin : object glipizide-metformin : object glimepiride-pioglitazone : object metformin-rosiglitazone : object metformin-pioglitazone : object change : object diabetesMed : object TARGET : int64 group_diag_1 : object group_diag_2 : object group_diag_3 : object
We see that our features are of the correct data type as desired.
Note that the IDs in our case are formatted as, e.g., 3 instead of 03. Rather than specifying the dtype parameter at load time to treat the ID columns as non-numeric, we set the data type after loading to preserve the order, particularly for later visualization.
Moreover, in contrast to R, we don't convert our binary target variable (0, 1) to type object in Python. Most machine learning libraries like scikit-learn, XGBoost, or LightGBM expect numerical inputs for classification, and a binary target with values 0 and 1 is perfect for binary classification.
Subsequently, we will assess the balance of the dataset by plotting the distribution of the 'TARGET' variable.
# Count the occurrences of each unique value in the 'TARGET' column
target_counts = df_raw['TARGET'].value_counts()
# Calculate the total number of entries
total_entries = len(df_raw['TARGET'])
# Define a custom formatting function for the pie chart labels
def custom_autopct(pct):
absolute = int(pct/100.*total_entries)
return "{:.2f}%\n({:d} entries)".format(pct, absolute)
# Create a pie chart to visualize the distribution of classes in the 'TARGET' column
pie_chart, _, autotexts = plt.pie(target_counts, labels=target_counts.index.map({0: 'Class 0', 1: 'Class 1'}),
autopct=custom_autopct, startangle=90,
colors=[COLOR_LIGHT, COLOR_DARK], wedgeprops={'edgecolor': 'none'})
# Set the color of the autopct texts to white
for autotext in autotexts:
autotext.set_color('white')
# Plot the pie chart
plt.title('Distribution of the TARGET Variable')
plt.show()
The pie chart illustrates a clear imbalance in the distribution of the dependent variable TARGET: The critical class 1 appears significantly less frequently than class 0.
Accurate classification of rare events is challenging in imbalanced datasets, such as the one observed with the TARGET variable. In these scenarios, it is advisable to use the area under the receiver operating characteristic curve (AUC) rather than accuracy as the performance metric. While accuracy reflects overall prediction correctness, it can be misleading by not accounting adequately for the performance on minority classes. In contrast, AUC evaluates a model's ranking ability, which is particularly relevant in distinguishing between classes in imbalanced contexts. AUC measures the trade-off between the true positive rate and the false positive rate, providing a more nuanced understanding of a model's capacity to identify rare events. We will look at this in more detail in Part B.
1.3 Replacing aka. detecting missing categorical values
While many machine learning (ML) tools require the more complex data preparation steps described in Part B, CatBoost, the ML tool of our choice, only requires the minimal data preparation implemented below by replacing missing categorical values with a predefined value.
In our case where we know our dataset has no NA values, this session just serves as a sanity check.
# Identify categorical features by data type 'object'
categorical_features = list(df_raw.select_dtypes(include=['object']).columns)
# Compute the number of unique values for each categorical feature and reset the index to make 'Feature Names' a column
categorical_feature_counts = df_raw[categorical_features].nunique().reset_index()
# Set the column names to 'Categorical Feature' and 'Number of Unique Values'
categorical_feature_counts.columns = ['Categorical Feature', 'Number of Unique Values']
# Sort the DataFrame by 'Number of Unique Values' in descending order
categorical_feature_counts.sort_values(by='Number of Unique Values', ascending=False, inplace=True)
# Display the resulting DataFrame
display(categorical_feature_counts.style.hide(axis='index'))
| Categorical Feature | Number of Unique Values |
|---|---|
| diag_3 | 724 |
| diag_2 | 678 |
| diag_1 | 661 |
| medical_specialty | 45 |
| payer_code | 17 |
| discharge_disposition_id | 15 |
| admission_source_id | 11 |
| group_diag_3 | 10 |
| group_diag_1 | 10 |
| age | 10 |
| group_diag_2 | 10 |
| weight | 8 |
| admission_type_id | 7 |
| race | 6 |
| metformin | 4 |
| pioglitazone | 4 |
| max_glu_serum | 4 |
| A1Cresult | 4 |
| insulin | 4 |
| repaglinide | 4 |
| rosiglitazone | 4 |
| glyburide | 4 |
| glimepiride | 4 |
| glipizide | 4 |
| acarbose | 3 |
| nateglinide | 3 |
| tolbutamide | 2 |
| chlorpropamide | 2 |
| diabetesMed | 2 |
| change | 2 |
| gender | 2 |
| glyburide-metformin | 2 |
| miglitol | 2 |
| tolazamide | 2 |
| citoglipton | 1 |
| glipizide-metformin | 1 |
| glimepiride-pioglitazone | 1 |
| metformin-rosiglitazone | 1 |
| metformin-pioglitazone | 1 |
| examide | 1 |
| acetohexamide | 1 |
| troglitazone | 1 |
Next, we identify those categorical features that contain missing values and calculate the percentage of values that are absent.
# Define a function that returns a DataFrame with features containing missing values
# and the percentage of missing values for each of these features
def missing_values_percentage(data):
# Calculate percentages and create a DataFrame
nan_percentages = data.isna().sum() * 100 / len(data)
missing_values_df = pd.DataFrame({
'Categorical Feature': nan_percentages.index,
'Missing Percent': nan_percentages.values
})
# Format the 'Missing Percent' column as a percentage string with two decimal places
missing_values_df['Missing Percent'] = missing_values_df['Missing Percent'].apply(lambda x: f'{x:.2f}%')
# Filter out features without missing values and sort by percentage
missing_values_df = missing_values_df[missing_values_df['Missing Percent'] != '0.00%']
missing_values_df.sort_values(by='Missing Percent', ascending=False, inplace=True)
return missing_values_df
# Calculate the proportion of missing values for the categorical features
missing_values_df = missing_values_percentage(df_raw[categorical_features])
# Display the resulting DataFrame without the index using the new recommended Styler.hide method
missing_values_df_styled = missing_values_df.style.hide(axis='index')
display(missing_values_df_styled)
| Categorical Feature | Missing Percent |
|---|
To address missing values in categorical columns without imputing data or dropping records, we will assign instances of missing data to a distinct category, __missingValue__. This strategy allows our machine learning algorithms to process missingness as an informative feature in its own right, thereby maintaining the dataset's full breadth and the integrity of input for modeling.
In this case, it confirms that our dataset has no "real" missing values NA as expected. As informed earlier, the missing values have been preprocessed by assigning instance of missing data to a distinct category, ?. Hence, we include and comment out the following code for the sake of completeness.
# # Replace missing values with a new category '__missingValue__' in all categorical columns
# df_raw[categorical_features] = df_raw[categorical_features].fillna('__missingValue__')
1.4 Partitioning data into training, validation, and test subsets
When we train a machine learning classifier, we don't just want it to learn to model the training data, or simply memorisz them (ML models can be large and complex enough to do this). We want the model to generalize to data it hasn't seen before. Therefore, model performance is usually measured against a held-out test set consisting of examples that have never been seen before. For consistency with upcoming model optimization strategies, we randomly divide our dataset into three parts:
- Training set: 70% of examples used for model training
- Validation set: 15% to validate our models after the training and possibly decide on changes
- Test set: 15% for the final measurement of the generalization performance of all relevant models (see Section 11)
This division allows for effective model training, hyperparameter tuning to prevent overfitting, and unbiased evaluation of performance on new, unseen data.
Given the original code block, no additional modifications are required apart from dropping only the column TARGET in the feature set X_raw.
# Constants for data split ratios
TRAIN_RATIO = 0.7
VAL_TEST_RATIO = 0.5 # Splitting the remaining 30% equally into validation and test
# Separate the features (X) and the target (y)
X_raw = df_raw.drop(columns=['TARGET'], axis=1)
y = df_raw['TARGET']
# Split the dataset into a training set and a combined validation and test set
X_raw_train, X_raw_val_test, y_train, y_val_test = train_test_split(X_raw, y, train_size=TRAIN_RATIO, random_state=RANDOM_SEED)
# Further split the combined validation and test set into separate validation and test sets
X_raw_val, X_raw_test, y_val, y_test = train_test_split(X_raw_val_test, y_val_test, test_size=VAL_TEST_RATIO, random_state=RANDOM_SEED)
# Verify the dimensions of the training, validation, and test sets by displaying (rows, columns)
print(f"Training set dimensions (rows, columns): {X_raw_train.shape}")
print(f"Validation set dimensions (rows, columns): {X_raw_val.shape}")
print(f"Test set dimensions (rows, columns): {X_raw_test.shape}")
Training set dimensions (rows, columns): (33425, 50) Validation set dimensions (rows, columns): (7163, 50) Test set dimensions (rows, columns): (7163, 50)
For the three samples, check the mean values and standard deviations of the TARGET variable y:
# Calculate TARGET mean and std for each set
stats = pd.DataFrame({
'Mean': [y_train.mean(), y_val.mean(), y_test.mean()],
'Standard Deviation': [y_train.std(), y_val.std(), y_test.std()]
}, index=['Train', 'Validation', 'Test']).round(4)
print(stats)
Mean Standard Deviation Train 0.1314 0.3378 Validation 0.1312 0.3377 Test 0.1319 0.3384
These randomly generated differences in the mean values of the TARGET variables appear small at first, but can have an impact on the final evalution, as we will see in Section 11.
1.5 Training a default CatBoost classifier (baseline model)
We proceed by applying the CatBoost model to our prepared dataset, utilizing the default settings for a straightforward evaluation without hyperparameter tuning.
# Start timer to calculate the running time of training the CatBoost model
start_time = time.time()
# Fit the CatBoostClassifier on the training data
CB1 = CatBoostClassifier(eval_metric='AUC', random_seed=RANDOM_SEED)
CB1.fit(X_raw_train, y_train, cat_features=categorical_features, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB1 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB1:.2f}")
Elapsed time (sec): 157.66
To showcase the breadth of customization available for refining the CatBoost model, we present a complete list of hyperparameters and their default settings.
# Display all model hyperparameters in a DataFrame
hyperparams_list = [(k, v) for k, v in CB1.get_all_params().items()]
hyperparams_df = pd.DataFrame(hyperparams_list, columns=['Hyperparameter', 'Value'])
display(hyperparams_df.style.hide(axis='index'))
| Hyperparameter | Value |
|---|---|
| nan_mode | Min |
| eval_metric | AUC |
| combinations_ctr | ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'] |
| iterations | 1000 |
| sampling_frequency | PerTree |
| fold_permutation_block | 0 |
| leaf_estimation_method | Newton |
| random_score_type | NormalWithModelSizeDecrease |
| counter_calc_method | SkipTest |
| grow_policy | SymmetricTree |
| penalties_coefficient | 1 |
| boosting_type | Plain |
| model_shrink_mode | Constant |
| feature_border_type | GreedyLogSum |
| ctr_leaf_count_limit | 18446744073709551615 |
| bayesian_matrix_reg | 0.100000 |
| one_hot_max_size | 2 |
| eval_fraction | 0 |
| force_unit_auto_pair_weights | False |
| l2_leaf_reg | 3 |
| random_strength | 1 |
| rsm | 1 |
| boost_from_average | False |
| max_ctr_complexity | 4 |
| model_size_reg | 0.500000 |
| simple_ctr | ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'] |
| pool_metainfo_options | {'tags': {}} |
| subsample | 0.800000 |
| use_best_model | False |
| class_names | [0, 1] |
| random_seed | 42 |
| depth | 6 |
| ctr_target_border_count | 1 |
| posterior_sampling | False |
| has_time | False |
| store_all_simple_ctr | False |
| border_count | 254 |
| classes_count | 0 |
| auto_class_weights | None |
| sparse_features_conflict_fraction | 0 |
| leaf_estimation_backtracking | AnyImprovement |
| best_model_min_trees | 1 |
| model_shrink_rate | 0 |
| min_data_in_leaf | 1 |
| loss_function | Logloss |
| learning_rate | 0.046101 |
| score_function | Cosine |
| task_type | CPU |
| leaf_estimation_iterations | 10 |
| bootstrap_type | MVS |
| max_leaves | 64 |
| permutation_count | 4 |
In order to gain preliminary insights into the trained model's behavior, we visualize the built-in feature importances that highlight the most influential features driving the predictions.
# Extract the 15 most important features and create a Series for plotting
feature_importances = pd.Series(CB1.feature_importances_, index=X_raw_train.columns).nlargest(15)
# Plot the feature importances
feature_importances.plot(kind='barh', color=COLOR_DARK)
plt.gca().invert_yaxis()
plt.title('Top 15 Feature Importances in CatBoost Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
The feature importance plot above reveals that the most influential predictors are discharge_disposition_id, diag_1, and number_inpatient, which exert a significant impact. Additionally, the additional diagnoses, along with admission-related variables such as medical specialty, the patient's age as well as the patient utilization history such as num_lab_procedures, num_medications, time_in_hospital, are highlighted as key features.
We proceed to evaluate the performance of our CatBoost model against the validation set, using the area under the Receiver Operating Characteristic Curve (AUC) as our chosen evaluation metric (see above).
# Calculate the AUC score using the model's prediction probabilities for the positive class
auc_CB1 = roc_auc_score(y_val, CB1.predict_proba(X_raw_val)[:, 1])
# Print the validation AUC
print(f'The validation AUC of the CatBoost model with standard hyperparameters is: {auc_CB1:.6f}')
The validation AUC of the CatBoost model with standard hyperparameters is: 0.707713
In summary, Part A established a solid baseline for our binary classification task utilizing a CatBoost model with minimal data preparation and without adjustment of its hyperparameters. This is already a complex model, but it is very easy to build and is known for its high predictive power with default parameters. This benchmark serves as a reference point to assess the value of the more elaborate models constructed in the subsequent sections.
Task PT2: Logistic Regression and Feature Analysis [Learning objectives: 2.2, 5.1 — Points: 10] ¶
Based on the CSN notebook (Part B), the following adjustments must be made:
- a) In Section 2.1, perform only the necessary operations. Which parts can be commented out, and why?
- b) For the logistic regression in Section 2.2, follow this approach: Refer to the feature importance list of CatBoost and construct a model with at least five features. Pay attention to the measurement scale of the
features. Where do problems occur, and what are the possible explanations?
- c) In Section 2.3, only relevant parts need to be addressed. Which parts can be commented out, and why?
- d) The analysis of categorical features in Section 3.1 should be limited to
group_diag_1,group_diag_2,group_diag_3,age,medical_specialty,payer_codeunddischarge_disposition_idzu beschränken. - e) The analysis of numerical features in Section 3.2 should consider
number_outpatient,number_inpatient,num_procedures,num_lab_procedures,num_medications,time_in_hospital,number_emergencyandnumber_diagnoses.
- Section 3.3 as well as Section 4 (Enhancing CatBoost’s Performance with New Features) are not considered at this point.
- Section 5 is not considered.
- f) Execute Section 6:
- In Section 6.3, adjust
non_event_factorand provide a justification for this adjustment. - After Section 6.4, insert a new Section 6.5, in which the results of the models so far are compared.
- In Section 6.3, adjust
Consult Section 4.2 of the CSN notebook for guidance.
Solution:
Part B focuses on refining our machine learning strategy to extract valuable insights. We start with a thorough examination of logistic regression, a cornerstone of actuarial analytics, in its classical unregularized form. We then move on to a targeted exploratory data analysis and feature engineering to enhance our models' predictive power. An emphasis on model explainability is also included. To set the stage for the advanced modeling techniques in Part C, we wrap up with key preprocessing activities: encoding, scaling, and subsampling.
In this section, we dive into logistic regression, a classical and foundational technique in binary classification tasks. Since this method cannot deal with missing feature values, we start by imputing missing numerical values by median values. We then implement and evaluate a low-parameter classical logistic regression model. Following this, we refit the CatBoost model based on the modified datasets, compare the new model's performance with that of the benchmark model and interpret the result.
2.1 Preparing the data: Imputing missing numbers with median
In this subsection, we address the challenge of missing data by imputing these missing values, an essential step since logistic regression requires complete datasets to accurately calculate predictions and model parameters. We have already imputed missing values for categorical features with a distinct category in Subsection 1.3. For numerical features, we choose the median due to its robustness to outliers and its effectiveness in preserving the distribution of skewed datasets. Although median imputation may reduce data variability and possibly introduce bias if the missing data is not random, its simplicity offers a practical approach for setting up a baseline model in binary classification tasks. The code snippet below demonstrates how we impute missing values in numerical features using the median calculated from the training data to prevent data leakage.
Since our dataset contains no NA values, we comment out some of the following code lines acordingly and retain it for potential future use in other scenarios.
# # Identify numerical features
# numerical_features = list(X_raw_train.select_dtypes(exclude=['object']).columns)
# Copy the Input Data
df_pre = df_raw.copy()
# # Replace missing values for numeric features by median value from the training data
# for col in numerical_features:
# median_value = X_raw_train[col].median()
# df_pre[col] = df_pre[col].fillna(median_value)
# Check for remaining missing values:
num_missing = df_pre.isnull().sum().sum()
print(f"\nNumber of missing values after imputation: {num_missing}")
Number of missing values after imputation: 0
While our CatBoost model had no problems with very rare feature values, these initially prevented the upcoming logistic regression from converging. Therefore, we replace the extremely rare values with unknown gender (XNA) with the value “F”.
Since this data cleaning process was completed in Part I, we comment out the following code.
# # Replace very rare values
# print("\nGender_Frequency: ", df_pre.CODE_GENDER.value_counts())
# df_pre.loc[df_pre["CODE_GENDER"] != "M", "CODE_GENDER"] = "F"
As performed previously in Subsection 1.4, we will proceed to divide our preprocessed dataset into separate training, validation, and testing sets accordingly. This crucial step, essential for training, fine-tuning, and evaluating our machine learning models, ensures that our model can be validated and tested on data that it has not seen during the training process.
# Constants for data split ratios
TRAIN_RATIO = 0.7
VAL_TEST_RATIO = 0.5 # Splitting the remaining 30% equally into validation and test
# Separate the features (X) and the target (y) after preprocessing
X_pre = df_pre.drop(['TARGET'], axis=1)
y_pre = df_pre['TARGET']
# Split the preprocessed dataset into a training set and a combined validation and test set
X_pre_train, X_pre_val_test, y_pre_train, y_pre_val_test = train_test_split(X_pre, y_pre, train_size=TRAIN_RATIO, random_state=RANDOM_SEED)
# Further split the combined validation and test set into separate validation and test sets
X_pre_val, X_pre_test, y_pre_val, y_pre_test = train_test_split(X_pre_val_test, y_pre_val_test, test_size=VAL_TEST_RATIO, random_state=RANDOM_SEED)
# Verify the dimensions of the training, validation, and test sets by displaying (rows, columns)
print(f"Preprocessed training set dimensions (rows, columns): {X_pre_train.shape}")
print(f"Preprocessed validation set dimensions (rows, columns): {X_pre_val.shape}")
print(f"Preprocessed test set dimensions (rows, columns): {X_pre_test.shape}")
Preprocessed training set dimensions (rows, columns): (33425, 50) Preprocessed validation set dimensions (rows, columns): (7163, 50) Preprocessed test set dimensions (rows, columns): (7163, 50)
2.2 Logistic regression implementation: Fitting and performance evaluation
Preparing for a logistic regression analysis that echoes R's formula-based modeling style, we merge the features and target variable into a single dataframe. This setup facilitates the use of statsmodels' formula API to define and fit the logistic regression model succinctly. Mindful of potential overfitting and convergence issues, we strategically select only the top several features out of the original 50, guided by their importance rankings determined by CatBoost in Subsection 1.5. Once the model has been fitted, we provide a comprehensive overview of the logistic regression results to facilitate interpretation and analysis.
Refer to the feature importance list of CatBoost, we exclude the diagnosis variables diag_x and medical_specialty due to their high cardinality, which may hinder logistic regression convergence. We use only group_diag_1, omitting group_diag_2 and group_diag_3 for the sake of interpretability. A preliminary analysis suggests that including the latter two does not significantly improve model performance.
While we include admission_source_id and num_medications, we exclude admission_type_id resp. time_in_hospital due to their strong correlations with the included variables, as verified in Part I. Furthermore, we include number_diagnoses in our model, as it ranks among the top 10 features in the XGBoost feature importance plot from Part I. Finally, we exclude payer_code, as it does not contribute to improving the model's performance.
Hence, a total of eight features are selected to construct our logistic regression model.
# Combine features and target variable into a single dataframe for formula-based modeling
Xy_pre_train = X_pre_train.copy()
Xy_pre_train['TARGET'] = y_pre_train
# Start timer to calculate the running time of training the logistic regression model
start_time = time.time()
# Define the logistic regression model using statsmodels' formula API and fit it to the training data
LR1 = smf.logit(formula="TARGET ~ discharge_disposition_id + group_diag_1 + number_inpatient \
+ admission_source_id + num_lab_procedures + age + \
num_medications + number_diagnoses",
data=Xy_pre_train).fit()
# Calculate and print the running time in seconds
elapsed_time_LR1 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_LR1:.2f}")
# Display a summary of the logistic regression model results
print(LR1.summary())
Optimization terminated successfully.
Current function value: 0.365714
Iterations 8
Elapsed time (sec): 1.27
Logit Regression Results
==============================================================================
Dep. Variable: TARGET No. Observations: 33425
Model: Logit Df Residuals: 33378
Method: MLE Df Model: 46
Date: Tue, 13 May 2025 Pseudo R-squ.: 0.05995
Time: 10:43:32 Log-Likelihood: -12224.
converged: True LL-Null: -13004.
Covariance Type: nonrobust LLR p-value: 1.047e-296
===================================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------
Intercept -4.0687 0.721 -5.644 0.000 -5.482 -2.656
discharge_disposition_id[T.2] 0.5727 0.096 5.947 0.000 0.384 0.761
discharge_disposition_id[T.3] 0.6122 0.052 11.748 0.000 0.510 0.714
discharge_disposition_id[T.4] 0.5608 0.171 3.273 0.001 0.225 0.897
discharge_disposition_id[T.5] 1.2109 0.112 10.854 0.000 0.992 1.430
discharge_disposition_id[T.6] 0.2691 0.055 4.869 0.000 0.161 0.377
discharge_disposition_id[T.7] 0.0973 0.225 0.432 0.666 -0.344 0.539
discharge_disposition_id[T.8] 0.4815 0.488 0.987 0.324 -0.475 1.437
discharge_disposition_id[T.15] 2.8252 0.501 5.643 0.000 1.844 3.806
discharge_disposition_id[T.18] 0.2945 0.088 3.352 0.001 0.122 0.467
discharge_disposition_id[T.22] 1.4496 0.084 17.294 0.000 1.285 1.614
discharge_disposition_id[T.23] -1.1954 0.460 -2.598 0.009 -2.097 -0.294
discharge_disposition_id[T.24] -0.5908 1.040 -0.568 0.570 -2.629 1.447
discharge_disposition_id[T.25] 0.0319 0.182 0.175 0.861 -0.324 0.388
discharge_disposition_id[T.28] 1.9884 0.289 6.892 0.000 1.423 2.554
group_diag_1[T.Diabetes] 0.0427 0.068 0.625 0.532 -0.091 0.177
group_diag_1[T.Digestive] -0.1658 0.066 -2.503 0.012 -0.296 -0.036
group_diag_1[T.Genitourinary] -0.1996 0.082 -2.435 0.015 -0.360 -0.039
group_diag_1[T.Injury] -0.2360 0.068 -3.470 0.001 -0.369 -0.103
group_diag_1[T.Musculoskeletal] -0.3221 0.080 -4.036 0.000 -0.478 -0.166
group_diag_1[T.Neoplasms] -0.1262 0.082 -1.539 0.124 -0.287 0.034
group_diag_1[T.Other] -0.1623 0.052 -3.122 0.002 -0.264 -0.060
group_diag_1[T.Respiratory] -0.3503 0.060 -5.814 0.000 -0.468 -0.232
group_diag_1[T.n.a.] 0.6518 0.909 0.717 0.473 -1.130 2.433
admission_source_id[T.2] -0.2494 0.159 -1.569 0.117 -0.561 0.062
admission_source_id[T.3] 0.5929 0.303 1.959 0.050 -0.000 1.186
admission_source_id[T.4] -0.3855 0.096 -4.036 0.000 -0.573 -0.198
admission_source_id[T.5] -0.5668 0.204 -2.777 0.005 -0.967 -0.167
admission_source_id[T.6] -0.2210 0.117 -1.891 0.059 -0.450 0.008
admission_source_id[T.7] 0.0393 0.042 0.944 0.345 -0.042 0.121
admission_source_id[T.8] -0.1709 1.097 -0.156 0.876 -2.321 1.979
admission_source_id[T.9] -0.2962 0.475 -0.623 0.533 -1.228 0.636
admission_source_id[T.17] 0.0189 0.073 0.260 0.795 -0.124 0.162
admission_source_id[T.20] 0.7139 0.306 2.334 0.020 0.114 1.313
age[T.[10-20)] 1.0586 0.754 1.403 0.160 -0.420 2.537
age[T.[20-30)] 1.3588 0.732 1.857 0.063 -0.075 2.793
age[T.[30-40)] 1.2057 0.725 1.663 0.096 -0.215 2.626
age[T.[40-50)] 1.2979 0.721 1.799 0.072 -0.116 2.712
age[T.[50-60)] 1.2224 0.721 1.696 0.090 -0.190 2.635
age[T.[60-70)] 1.3950 0.721 1.936 0.053 -0.018 2.808
age[T.[70-80)] 1.5193 0.721 2.108 0.035 0.107 2.932
age[T.[80-90)] 1.4326 0.721 1.986 0.047 0.019 2.846
age[T.[90-100)] 1.1406 0.728 1.567 0.117 -0.286 2.567
number_inpatient 0.5878 0.025 23.090 0.000 0.538 0.638
num_lab_procedures 0.0027 0.001 2.902 0.004 0.001 0.005
num_medications 0.0036 0.002 1.654 0.098 -0.001 0.008
number_diagnoses 0.0557 0.010 5.755 0.000 0.037 0.075
===================================================================================================
Where feature measurement scale matters in our logistic regression:
Note that the features we used are either categorical or integer. Since categorical features are one-hot encoded automatically during logistic regression and one-hot encoded features are already in {0, 1}, there is no issue regarding measurement scale here and standard-scaling one-hot encoded variables will only ruin interpretability and won't be meaningful.
However, for integer features such as number_inpatient, num_lab_procedures, and num_medications, as for any numeric features, measurement scaling plays a crucial role in training linear models, especially if the values vary widely across features. For example, while number_inpatient varies between 0 and 12, num_lab_procedures has a larger range from 0 to above 130.
Logistic regression is a linear model, without scaling, problems arise can be in:
Model Convergence and Optimization
Logistic regression uses gradient descent or similar optimization techniques. If features vary greatly in scale (e.g., one feature ranges from [0, 1], another from [0, 1000]), the optimization landscape becomes distorted, and convergence may be slower, unstable, or even fail completely in some optimization solvers such as
liblinear.Regularization Sensitivity
If some common regularization techniques are used such as L1 (Lasso) or L2 (Ridge), scaling is crucial since regularization penalizes large coefficients. If a feature has a larger scale and hence a smaller coefficient, it can dominate the penalty, making the model biased toward it even if it is not more informative.
Coefficient Interpretability
Coefficients in logistic regression represent the change in log-odds per unit change in a feature. Without standardization, interpreting these coefficients can be misleading. For example, a coefficient of 1 on a feature ranging from [0, 1] may have a much greater impact than a coefficient of 0.001 on a feature ranging from [0, 1000], even though their raw values suggest otherwise. The difference in scale can obscure the true importance of the features.
Now that our logistic regression model is trained, we evaluate its performance on the validation set by calculating the AUC, which provides us with a single metric summarizing the model's ability to discriminate between the positive and negative classes.
# Evaluate the performance of the logistic regression model on the validation data
# and calculate the Area Under the Curve (AUC) for the ROC
auc_LR1 = roc_auc_score(y_pre_val, LR1.predict(X_pre_val))
# Print the validation AUC
print(f'The validation AUC of the logistic regression model is: {auc_LR1:.6f}')
The validation AUC of the logistic regression model is: 0.680391
2.3 Data imputation impact: Re-evaluating CatBoost with median replacement
At the beginning of this section, we proposed imputing missing numerical values using their median; this is in contrast to CatBoost's default behavior of replacing missing values with a minimal value below any observed non-missing value (see https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing), which was implicitly applied in Section 1. As this default strategy may not be optimal for all features, we now examine whether median imputation can enhance the performance of our CatBoost model.
In our case, as mentioned in Section 1, since no missing values NA exist in our dataset and hence no changes were made thereafter, we can safely comment out the following code and retain this section for future use.
# # Start timer to calculate the running time of training the CatBoost model with median replacement
# start_time = time.time()
# # Initialize a CatBoostClassifier and fit it to the preprocessed training data
# CB2 = CatBoostClassifier(eval_metric='AUC', random_seed=RANDOM_SEED)
# CB2.fit(X_pre_train, y_train, cat_features=categorical_features, logging_level='Silent')
# # Calculate and print the running time in seconds
# elapsed_time_CB2 = time.time() - start_time
# print(f"Elapsed time (sec): {elapsed_time_CB2:.2f}")
# # Evaluate the performance of the CatBoost model with median replacement on the
# # validation data and calculate the Area Under the Curve (AUC) for the ROC
# auc_CB2 = roc_auc_score(y_val, CB2.predict_proba(X_pre_val)[:,1])
# print(f'The validation AUC of the CatBoost model with median replacement is: {auc_CB2:.6f}')
# print(f'The validation AUC of the CatBoost model with minimum replacement was: {auc_CB1:.6f}')
In Section 3, we conduct a focused Exploratory Data Analysis (EDA), analyzing the dataset's most significant categorical and numerical variables, and then explore feature engineering to conceive new attributes with the potential to elevate our models' predictive capabilities. Furthermore, we will concentrate on a selection of pivotal numerical features identified in Subsection 1.5, along with a handful of categorical features of particular interest.
3.1 Analysis of categorical features
First, we examine the key categorical variables
- group_diag_1,
- group_diag_2,
- group_diag_3,
- age,
- medical_specialty,
- payer_code, and
- discharge_disposition_id,
to gauge their influence on the risk of hospital readmission among diabetic patients. Our analysis involves:
- Creating stacked bar charts to contrast patient counts (with the y-axis set to a logarithmic scale) and percentage distributions across different categories with respect to the target outcome.
- Comparing the raw number of readmissions to the proportion within each category to uncover potential predictive patterns.
By examining both the count and percentage distributions side by side, we can better understand how each categorical feature correlates with the likelihood of a diabetic patient being readmitted in less than 30 days.
# List of selected categorical features to analyze
selected_categorical_features = ['group_diag_1', 'group_diag_2', 'group_diag_3',
'age', 'medical_specialty', 'payer_code', 'discharge_disposition_id']
# Number of rows/columns for the subplot grid
n_cols = 2 # doubled to fit count and percentage plots side by side
n_rows = len(selected_categorical_features) # one row for each feature
# Set up the matplotlib figure
plt.figure(figsize=(n_cols * 9, n_rows * 5))
# Loop through the number of categorical features
for idx, feature in enumerate(selected_categorical_features):
# Create a crosstab for stacked bar plot structure
ctab = pd.crosstab(df_pre[feature], df_pre['TARGET'])
# (1st column) add a new subplot iteratively for count values
ax1 = plt.subplot(n_rows, n_cols, idx * n_cols + 1)
# Create a stacked bar plot for count values
ctab.plot(kind="bar", stacked=True, color=[COLOR_LIGHT, COLOR_DARK], edgecolor="none", ax=ax1)
# Additional plot settings for count subplot
ax1.set_title(f'Stacked bar plot of {feature}')
ax1.set_xlabel(feature)
ax1.set_ylabel('Count')
ax1.set_yscale('log') # Set y-axis to log base 10
ax1.legend(title='TARGET', loc='center left', bbox_to_anchor=(1, 0.5))
plt.xticks(rotation=45, ha='right')
# (2nd column) add a new subplot iteratively for percentage values
ax2 = plt.subplot(n_rows, n_cols, idx * n_cols + 2)
# Normalize the crosstab by row and multiply by 100 to convert to percentages
ctab_normalized = ctab.div(ctab.sum(axis=1), axis=0) * 100
# Create a stacked bar plot for percentage values
ctab_normalized.plot(kind="bar", stacked=True, color=[COLOR_LIGHT, COLOR_DARK], edgecolor="none", ax=ax2)
# Additional plot settings for percentage subplot
ax2.set_title(f'Stacked bar plot of {feature} (in %)')
ax2.set_xlabel(feature)
ax2.set_ylabel('Percentage [%]')
ax2.legend().remove()
plt.xticks(rotation=45, ha='right')
# Make yticks be in percentages for the percentage subplot
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:.0f}%".format(int(x))))
ax2.set_ylim(0, 100)
# Preventing subplots from being too close to each other and display the final plot
plt.tight_layout()
plt.show()
Insights from the categorical feature analysis within the Diabetes 130-US Hospitals (1999-2008) dataset suggest:
- Diagnostic Groups (
group_diag_1,group_diag_2,group_diag_3): While readmission rates (TARGET = 1) are relatively consistent across diagnostic categories, groups such as Neoplasms, Circulatory, and Respiratory exhibit slightly higher readmission rates. In contrast, Diabetes, Digestive, and Musculoskeletal diagnoses are associated with slightly lower rates of readmission. - Age Groups (
age): Patient distribution is heavily skewed toward older age groups. Readmission rates generally increase with age, peaking in the [70–80) and [80–90) brackets before declining in the oldest group. - Medical Specialty (
medical_specialty): Although a large portion of data falls under the missing category (?), readmission rates vary notably across specialties. Hematology/Oncology and Oncology show the highest readmission rates. - Payer Code (
payer_code): Patients covered under DM and MC plans exhibit higher readmission tendencies, whereas those under WC are associated with lower readmission rates. - Discharge Disposition ID (
discharge_disposition_id): Readmission rates differ significantly across discharge categories. Dispositions such as15,22, and28are linked to markedly higher readmission rates, while dispositions like23and24are associated with significantly lower rates.
As with any data-driven insights, statistical validation is advisable before applying these findings.
3.2 Analysis of numerical features
Next, we analyze the key numerical variables
- number_outpatient,
- number_inpatient,
- num_procedures,
- num_lab_procedures,
- num_medications,
- time_in_hospital,
- number_emergency, and
- number_diagnoses,
to detect patterns related to hospital readmission outcomes. Through Kernel Density Estimation (KDE) plots, we observe the density distribution of each feature for both readmitted and non-readmitted patients. By comparing these distributions, we can identify which numerical predictors might have significant predictive power, a critical step for effective feature engineering and subsequent model building in binary classification tasks.
# List of selected numerical features to analyze
selected_numerical_features = ['number_outpatient', 'number_inpatient', 'num_procedures', 'num_lab_procedures',
'num_medications', 'time_in_hospital', 'number_emergency', 'number_diagnoses']
# number of plots, set up the subplot grid
n_plots = len(selected_numerical_features)
n_cols = 2 # adjust the number of columns based on your preference
n_rows = (n_plots + n_cols - 1) // n_cols
# set up the matplotlib figure
plt.figure(figsize=(n_cols * 5, n_rows * 4))
# loop through the list of numerical features
for idx, feature in enumerate(selected_numerical_features):
ax = plt.subplot(n_rows, n_cols, idx + 1)
# Create a color mapping based on TARGET values assuming values are 0 and 1
color_mapping = {0: COLOR_LIGHT, 1: COLOR_DARK}
class_vals = df_pre['TARGET'].unique()
for val in class_vals:
subset = df_pre[df_pre['TARGET'] == val]
# Draw the KDE for each class
sns.kdeplot(subset[feature], ax=ax, label=f'TARGET {val}', color=color_mapping[val], linewidth=2.5)
# additional plot settings
ax.set_title(f'Distribution of {feature}')
ax.set_xlabel(f'{feature}')
ax.set_ylabel('Density')
ax.legend(title='TARGET')
# Preventing subplots from being too close to each other and display the final plot
plt.tight_layout()
plt.show()
The analysis of numerical features within the dataset yields the following key insights:
number_inpatient,number_emergency, andnumber_outpatient, which indicate prior hospital visits, are positively correlated with readmission risk, withnumber_inpatientbeing the strongest predictor. Patients with more frequent prior visits, particularly inpatient ones, are more likely to be readmitted.num_procedures, which represents the number of procedures (excluding lab tests) performed during the encounter, shows little variation between TARGET=0 and TARGET=1. However, TARGET=1 exhibits a slightly heavier right tail.num_lab_proceduresandnum_medications, indicating the number of lab tests and medications respectively, are slightly skewed toward readmitted patients, suggesting that higher counts may be associated with readmission.- The distribution of
time_in_hospital, representing the duration of hospital stay, is right-shifted for TARGET=1 compared to TARGET=0, indicating that longer stays are linked to a higher risk of readmission. number_diagnosesshows that a slightly higher number of diagnoses is associated with an increased likelihood of readmission.
In this section, we concentrate on two essential preprocessing steps: encoding to make categorical data machine-readable, essential for many machine learning algorithms, and scaling to normalize the values of numerical features, aiding in algorithm efficiency. We also tackle class imbalance by using subsampling methods, creating a more balanced training dataset and boosting computational effectiveness for our machine learning models.
6.1 Encoding categorical data and scaling numerical values
In general, machine learning models require input data to be in numerical format to perform mathematical computations. However, real-world datasets—such as the one considered here—frequently contain categorical or text data. Hence, encoding is needed to convert this non-numerical data into a numerical form that can be understood and processed by machine learning algorithms.
The two most commonly used encoding methods are the following:
Label-Encoding: Converts each value in a column to a number. Good for ordinal data, i.e., when categories have an inherent order.
One-Hot-Encoding: Creates a binary column for each category/label present in the column. This is useful for nominal variables, where the categories don't have any order or priority.
In the following, we will apply One-Hot-Encoding to our dataset in order to make it processable for machine learning methods such as Artificial Neural Nets or XGBoost.
# Output the dimensions of the dataset prior to dummy encoding
print("Shape of the dataset before dummy encoding: ", df_pre.shape)
# Perform dummy encoding on categorical features, removing the first level to avoid dummy variable trap
df = pd.get_dummies(df_pre, drop_first=True)
# Output the dimensions of the dataset after dummy encoding to show the change
print("Shape of the dataset after dummy encoding: ", df.shape)
Shape of the dataset before dummy encoding: (47751, 51) Shape of the dataset after dummy encoding: (47751, 2249)
# Change columns names (LightGBM does not support special JSON characters in feature names)
new_names = {col: re.sub(r'[^A-Za-z0-9_]+', '', col) for col in df.columns}
new_n_list = list(new_names.values())
new_names = {col: f'{new_col}_{i}' if new_col in new_n_list[:i] else new_col for i, (col, new_col) in enumerate(new_names.items())}
df = df.rename(columns=new_names)
df.head(1)
| time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | TARGET | race_AfricanAmerican | race_Asian | race_Caucasian | race_Hispanic | race_Other | gender_Male | age_1020 | age_2030 | age_3040 | age_4050 | age_5060 | age_6070 | age_7080 | age_8090 | age_90100 | weight_025 | weight_100125 | weight_125150 | weight_150175 | weight_2550 | weight_5075 | weight_75100 | admission_type_id_2 | admission_type_id_3 | admission_type_id_5 | admission_type_id_6 | admission_type_id_7 | admission_type_id_8 | discharge_disposition_id_2 | discharge_disposition_id_3 | discharge_disposition_id_4 | discharge_disposition_id_5 | discharge_disposition_id_6 | discharge_disposition_id_7 | discharge_disposition_id_8 | discharge_disposition_id_15 | discharge_disposition_id_18 | discharge_disposition_id_22 | discharge_disposition_id_23 | discharge_disposition_id_24 | discharge_disposition_id_25 | discharge_disposition_id_28 | admission_source_id_2 | admission_source_id_3 | admission_source_id_4 | admission_source_id_5 | admission_source_id_6 | admission_source_id_7 | admission_source_id_8 | admission_source_id_9 | admission_source_id_17 | admission_source_id_20 | payer_code_BC | payer_code_CH | payer_code_CM | payer_code_CP | payer_code_DM | payer_code_HM | payer_code_MC | payer_code_MD | payer_code_MP | payer_code_OG | payer_code_OT | payer_code_PO | payer_code_SI | payer_code_SP | payer_code_UN | payer_code_WC | medical_specialty_AnesthesiologyPediatric | medical_specialty_Cardiology | medical_specialty_EmergencyTrauma | medical_specialty_Endocrinology | medical_specialty_FamilyGeneralPractice | medical_specialty_Gastroenterology | medical_specialty_Gynecology | medical_specialty_Hematology | medical_specialty_HematologyOncology | medical_specialty_Hospitalist | medical_specialty_InfectiousDiseases | medical_specialty_InternalMedicine | medical_specialty_Nephrology | medical_specialty_Neurology | medical_specialty_ObstericsGynecologyGynecologicOnco | medical_specialty_Obstetrics | medical_specialty_ObstetricsandGynecology | medical_specialty_Oncology | medical_specialty_Ophthalmology | medical_specialty_Orthopedics | medical_specialty_OrthopedicsReconstructive | medical_specialty_Osteopath | medical_specialty_Otolaryngology | medical_specialty_Pediatrics | medical_specialty_PediatricsCriticalCare | medical_specialty_PediatricsEndocrinology | medical_specialty_PhysicalMedicineandRehabilitation | medical_specialty_Podiatry | medical_specialty_Psychiatry | medical_specialty_Psychology | medical_specialty_Pulmonology | medical_specialty_Radiologist | medical_specialty_Radiology | medical_specialty_Rheumatology | medical_specialty_Surgeon | medical_specialty_SurgeryCardiovascular | medical_specialty_SurgeryCardiovascularThoracic | medical_specialty_SurgeryGeneral | medical_specialty_SurgeryNeuro | medical_specialty_SurgeryPlastic | medical_specialty_SurgeryThoracic | medical_specialty_SurgeryVascular | medical_specialty_SurgicalSpecialty | medical_specialty_Urology | diag_1_11 | diag_1_110 | diag_1_112 | diag_1_115 | diag_1_117 | diag_1_133 | diag_1_135 | diag_1_136 | diag_1_141 | diag_1_142 | diag_1_143 | diag_1_146 | diag_1_147 | diag_1_149 | diag_1_150 | diag_1_151 | diag_1_152 | diag_1_153 | diag_1_154 | diag_1_155 | diag_1_156 | diag_1_157 | diag_1_158 | diag_1_160 | diag_1_161 | diag_1_162 | diag_1_163 | diag_1_164 | diag_1_170 | diag_1_171 | diag_1_172 | diag_1_173 | diag_1_174 | diag_1_175 | diag_1_179 | diag_1_180 | diag_1_182 | diag_1_183 | diag_1_184 | diag_1_185 | diag_1_187 | diag_1_188 | diag_1_189 | diag_1_191 | diag_1_192 | diag_1_193 | diag_1_194 | diag_1_195 | diag_1_196 | diag_1_197 | diag_1_198 | diag_1_199 | diag_1_200 | diag_1_201 | diag_1_202 | diag_1_203 | diag_1_204 | diag_1_205 | diag_1_207 | diag_1_208 | diag_1_210 | diag_1_211 | diag_1_212 | diag_1_214 | diag_1_215 | diag_1_216 | diag_1_218 | diag_1_219 | diag_1_220 | diag_1_223 | diag_1_225 | diag_1_226 | diag_1_227 | diag_1_228 | diag_1_229 | diag_1_230 | diag_1_233 | diag_1_235 | diag_1_236 | diag_1_237 | diag_1_238 | diag_1_239 | diag_1_240 | diag_1_241 | diag_1_242 | diag_1_244 | diag_1_245 | diag_1_246 | diag_1_250 | diag_1_25001 | diag_1_25002 | diag_1_25003 | diag_1_2501 | diag_1_25011 | diag_1_25012 | diag_1_25013 | diag_1_2502 | diag_1_25021 | diag_1_25022 | diag_1_25023 | diag_1_2503 | diag_1_25031 | diag_1_25032 | diag_1_25033 | diag_1_2504 | diag_1_25041 | diag_1_25042 | diag_1_25043 | diag_1_2505 | diag_1_25051 | diag_1_25053 | diag_1_2506 | diag_1_2507 | diag_1_2508 | diag_1_25081 | diag_1_25082 | diag_1_25083 | diag_1_2509 | diag_1_25091 | diag_1_25092 | diag_1_25093 | diag_1_251 | diag_1_252 | diag_1_253 | diag_1_255 | diag_1_261 | diag_1_262 | diag_1_263 | diag_1_266 | diag_1_272 | diag_1_273 | diag_1_274 | diag_1_275 | diag_1_276 | diag_1_277 | diag_1_278 | diag_1_280 | diag_1_281 | diag_1_282 | diag_1_283 | diag_1_284 | diag_1_285 | diag_1_286 | diag_1_287 | diag_1_288 | diag_1_289 | diag_1_290 | diag_1_291 | diag_1_292 | diag_1_293 | diag_1_294 | diag_1_295 | diag_1_296 | diag_1_297 | diag_1_298 | diag_1_299 | diag_1_3 | diag_1_300 | diag_1_301 | diag_1_303 | diag_1_304 | diag_1_305 | diag_1_306 | diag_1_307 | diag_1_308 | diag_1_309 | diag_1_31 | diag_1_310 | diag_1_311 | diag_1_312 | diag_1_314 | diag_1_318 | diag_1_320 | diag_1_322 | diag_1_323 | diag_1_325 | diag_1_327 | diag_1_331 | diag_1_332 | diag_1_333 | diag_1_334 | diag_1_335 | diag_1_336 | diag_1_337 | diag_1_338 | diag_1_34 | diag_1_340 | diag_1_341 | diag_1_342 | diag_1_344 | diag_1_345 | diag_1_346 | diag_1_347 | diag_1_348 | diag_1_349 | diag_1_35 | diag_1_350 | diag_1_351 | diag_1_352 | diag_1_353 | diag_1_354 | diag_1_355 | diag_1_356 | diag_1_357 | diag_1_358 | diag_1_359 | diag_1_36 | diag_1_360 | diag_1_361 | diag_1_362 | diag_1_363 | diag_1_368 | diag_1_370 | diag_1_374 | diag_1_376 | diag_1_377 | diag_1_378 | diag_1_379 | diag_1_38 | diag_1_380 | diag_1_381 | diag_1_382 | diag_1_383 | diag_1_384 | diag_1_385 | diag_1_386 | diag_1_388 | diag_1_39 | diag_1_391 | diag_1_394 | diag_1_395 | diag_1_396 | diag_1_397 | diag_1_398 | diag_1_401 | diag_1_402 | diag_1_403 | diag_1_404 | diag_1_405 | diag_1_41 | diag_1_410 | diag_1_411 | diag_1_412 | diag_1_413 | diag_1_414 | diag_1_415 | diag_1_416 | diag_1_417 | diag_1_42 | diag_1_420 | diag_1_421 | diag_1_422 | diag_1_423 | diag_1_424 | diag_1_425 | diag_1_426 | diag_1_427 | diag_1_428 | diag_1_429 | diag_1_430 | diag_1_431 | diag_1_432 | diag_1_433 | diag_1_434 | diag_1_435 | diag_1_436 | diag_1_437 | diag_1_438 | diag_1_440 | diag_1_441 | diag_1_442 | diag_1_443 | diag_1_444 | diag_1_445 | diag_1_446 | diag_1_447 | diag_1_448 | diag_1_451 | diag_1_452 | diag_1_453 | diag_1_454 | diag_1_455 | diag_1_456 | diag_1_457 | diag_1_458 | diag_1_459 | diag_1_461 | diag_1_462 | diag_1_463 | diag_1_464 | diag_1_465 | diag_1_466 | diag_1_47 | diag_1_470 | diag_1_473 | diag_1_474 | diag_1_475 | diag_1_477 | diag_1_478 | diag_1_48 | diag_1_480 | diag_1_481 | diag_1_482 | diag_1_483 | diag_1_485 | diag_1_486 | diag_1_487 | diag_1_49 | diag_1_490 | diag_1_491 | diag_1_492 | diag_1_493 | diag_1_494 | diag_1_495 | diag_1_496 | diag_1_5 | diag_1_501 | diag_1_507 | diag_1_508 | diag_1_510 | diag_1_511 | diag_1_512 | diag_1_513 | diag_1_514 | diag_1_515 | diag_1_516 | diag_1_518 | diag_1_519 | diag_1_52 | diag_1_521 | diag_1_522 | diag_1_524 | diag_1_526 | diag_1_527 | diag_1_528 | diag_1_529 | diag_1_53 | diag_1_530 | diag_1_531 | diag_1_532 | diag_1_533 | diag_1_534 | diag_1_535 | diag_1_536 | diag_1_537 | diag_1_54 | diag_1_540 | diag_1_541 | diag_1_542 | diag_1_550 | diag_1_551 | diag_1_552 | diag_1_553 | diag_1_555 | diag_1_556 | diag_1_557 | diag_1_558 | diag_1_560 | diag_1_562 | diag_1_564 | diag_1_565 | diag_1_566 | diag_1_567 | diag_1_568 | diag_1_569 | diag_1_57 | diag_1_570 | diag_1_571 | diag_1_572 | diag_1_573 | diag_1_574 | diag_1_575 | diag_1_576 | diag_1_577 | diag_1_578 | diag_1_579 | diag_1_580 | diag_1_581 | diag_1_582 | diag_1_584 | diag_1_585 | diag_1_586 | diag_1_588 | diag_1_590 | diag_1_591 | diag_1_592 | diag_1_593 | diag_1_594 | diag_1_595 | diag_1_596 | diag_1_598 | diag_1_599 | diag_1_600 | diag_1_601 | diag_1_602 | diag_1_603 | diag_1_604 | diag_1_605 | diag_1_607 | diag_1_608 | diag_1_61 | diag_1_610 | diag_1_611 | diag_1_614 | diag_1_616 | diag_1_617 | diag_1_618 | diag_1_619 | diag_1_620 | diag_1_621 | diag_1_622 | diag_1_623 | diag_1_625 | diag_1_626 | diag_1_627 | diag_1_632 | diag_1_633 | diag_1_634 | diag_1_637 | diag_1_641 | diag_1_642 | diag_1_643 | diag_1_644 | diag_1_645 | diag_1_646 | diag_1_647 | diag_1_648 | diag_1_649 | diag_1_652 | diag_1_653 | diag_1_654 | diag_1_655 | diag_1_656 | diag_1_657 | diag_1_658 | diag_1_659 | diag_1_66 | diag_1_660 | diag_1_661 | diag_1_663 | diag_1_664 | diag_1_665 | diag_1_669 | diag_1_671 | diag_1_674 | diag_1_680 | diag_1_681 | diag_1_682 | diag_1_683 | diag_1_684 | diag_1_685 | diag_1_686 | diag_1_690 | diag_1_691 | diag_1_692 | diag_1_693 | diag_1_694 | diag_1_695 | diag_1_696 | diag_1_7 | diag_1_70 | diag_1_705 | diag_1_706 | diag_1_707 | diag_1_708 | diag_1_709 | diag_1_710 | diag_1_711 | diag_1_714 | diag_1_715 | diag_1_716 | diag_1_717 | diag_1_718 | diag_1_719 | diag_1_720 | diag_1_721 | diag_1_722 | diag_1_723 | diag_1_724 | diag_1_725 | diag_1_726 | diag_1_727 | diag_1_728 | diag_1_729 | diag_1_730 | diag_1_732 | diag_1_733 | diag_1_734 | diag_1_735 | diag_1_736 | diag_1_737 | diag_1_738 | diag_1_745 | diag_1_746 | diag_1_747 | diag_1_75 | diag_1_751 | diag_1_753 | diag_1_756 | diag_1_759 | diag_1_78 | diag_1_780 | diag_1_781 | diag_1_782 | diag_1_783 | diag_1_784 | diag_1_785 | diag_1_786 | diag_1_787 | diag_1_788 | diag_1_789 | diag_1_79 | diag_1_790 | diag_1_791 | diag_1_792 | diag_1_793 | diag_1_794 | diag_1_795 | diag_1_796 | diag_1_797 | diag_1_799 | diag_1_8 | diag_1_800 | diag_1_801 | diag_1_802 | diag_1_805 | diag_1_806 | diag_1_807 | diag_1_808 | diag_1_810 | diag_1_812 | diag_1_813 | diag_1_814 | diag_1_815 | diag_1_816 | diag_1_817 | diag_1_82 | diag_1_820 | diag_1_821 | diag_1_822 | diag_1_823 | diag_1_824 | diag_1_825 | diag_1_826 | diag_1_831 | diag_1_832 | diag_1_833 | diag_1_834 | diag_1_835 | diag_1_836 | diag_1_84 | diag_1_840 | diag_1_842 | diag_1_843 | diag_1_844 | diag_1_845 | diag_1_846 | diag_1_847 | diag_1_848 | diag_1_850 | diag_1_851 | diag_1_852 | diag_1_853 | diag_1_854 | diag_1_860 | diag_1_861 | diag_1_862 | diag_1_863 | diag_1_864 | diag_1_865 | diag_1_866 | diag_1_867 | diag_1_868 | diag_1_870 | diag_1_871 | diag_1_873 | diag_1_875 | diag_1_879 | diag_1_88 | diag_1_880 | diag_1_881 | diag_1_882 | diag_1_883 | diag_1_885 | diag_1_886 | diag_1_890 | diag_1_891 | diag_1_892 | diag_1_893 | diag_1_897 | diag_1_9 | diag_1_903 | diag_1_904 | diag_1_906 | diag_1_911 | diag_1_913 | diag_1_914 | diag_1_915 | diag_1_916 | diag_1_920 | diag_1_921 | diag_1_922 | diag_1_923 | diag_1_924 | diag_1_928 | diag_1_933 | diag_1_934 | diag_1_935 | diag_1_936 | diag_1_939 | diag_1_94 | diag_1_941 | diag_1_942 | diag_1_945 | diag_1_952 | diag_1_955 | diag_1_957 | diag_1_958 | diag_1_959 | diag_1_962 | diag_1_963 | diag_1_964 | diag_1_965 | diag_1_966 | diag_1_967 | diag_1_968 | diag_1_969 | diag_1_970 | diag_1_972 | diag_1_976 | diag_1_977 | diag_1_98 | diag_1_982 | diag_1_983 | diag_1_986 | diag_1_987 | diag_1_989 | diag_1_991 | diag_1_992 | diag_1_994 | diag_1_995 | diag_1_996 | diag_1_997 | diag_1_998 | diag_1_999 | diag_1_ | diag_1_E909 | diag_1_V26 | diag_1_V45 | diag_1_V51 | diag_1_V53 | diag_1_V54 | diag_1_V55 | diag_1_V56 | diag_1_V57 | diag_1_V58 | diag_1_V60 | diag_1_V63 | diag_1_V70 | diag_1_V71 | diag_2_110 | diag_2_111 | diag_2_112 | diag_2_114 | diag_2_117 | diag_2_123 | diag_2_131 | diag_2_135 | diag_2_136 | diag_2_137 | diag_2_138 | diag_2_140 | diag_2_145 | diag_2_150 | diag_2_151 | diag_2_152 | diag_2_153 | diag_2_154 | diag_2_155 | diag_2_156 | diag_2_157 | diag_2_162 | diag_2_163 | diag_2_164 | diag_2_172 | diag_2_173 | diag_2_174 | diag_2_179 | diag_2_180 | diag_2_183 | diag_2_185 | diag_2_186 | diag_2_188 | diag_2_189 | diag_2_191 | diag_2_192 | diag_2_193 | diag_2_196 | diag_2_197 | diag_2_198 | diag_2_199 | diag_2_200 | diag_2_201 | diag_2_202 | diag_2_203 | diag_2_204 | diag_2_205 | diag_2_208 | diag_2_211 | diag_2_212 | diag_2_214 | diag_2_215 | diag_2_217 | diag_2_218 | diag_2_220 | diag_2_225 | diag_2_226 | diag_2_227 | diag_2_228 | diag_2_233 | diag_2_235 | diag_2_238 | diag_2_239 | diag_2_240 | diag_2_241 | diag_2_242 | diag_2_244 | diag_2_245 | diag_2_246 | diag_2_250 | diag_2_25001 | diag_2_25002 | diag_2_25003 | diag_2_2501 | diag_2_25011 | diag_2_25012 | diag_2_25013 | diag_2_2502 | diag_2_25022 | diag_2_25023 | diag_2_25032 | diag_2_25033 | diag_2_2504 | diag_2_25041 | diag_2_25042 | diag_2_25043 | diag_2_2505 | diag_2_25051 | diag_2_25052 | diag_2_25053 | diag_2_2506 | diag_2_2507 | diag_2_2508 | diag_2_25081 | diag_2_25082 | diag_2_25083 | diag_2_2509 | diag_2_25091 | diag_2_25092 | diag_2_25093 | diag_2_251 | diag_2_252 | diag_2_253 | diag_2_255 | diag_2_258 | diag_2_259 | diag_2_260 | diag_2_261 | diag_2_262 | diag_2_263 | diag_2_266 | diag_2_268 | diag_2_269 | diag_2_27 | diag_2_271 | diag_2_272 | diag_2_273 | diag_2_274 | diag_2_275 | diag_2_276 | diag_2_277 | diag_2_278 | diag_2_279 | diag_2_280 | diag_2_281 | diag_2_282 | diag_2_283 | diag_2_284 | diag_2_285 | diag_2_286 | diag_2_287 | diag_2_288 | diag_2_289 | diag_2_290 | diag_2_291 | diag_2_292 | diag_2_293 | diag_2_294 | diag_2_295 | diag_2_296 | diag_2_297 | diag_2_298 | diag_2_299 | diag_2_300 | diag_2_301 | diag_2_302 | diag_2_303 | diag_2_304 | diag_2_305 | diag_2_306 | diag_2_307 | diag_2_308 | diag_2_309 | diag_2_31 | diag_2_310 | diag_2_311 | diag_2_312 | diag_2_314 | diag_2_316 | diag_2_317 | diag_2_319 | diag_2_320 | diag_2_322 | diag_2_323 | diag_2_324 | diag_2_327 | diag_2_331 | diag_2_332 | diag_2_333 | diag_2_335 | diag_2_336 | diag_2_337 | diag_2_338 | diag_2_34 | diag_2_340 | diag_2_342 | diag_2_343 | diag_2_344 | diag_2_345 | diag_2_346 | diag_2_347 | diag_2_348 | diag_2_349 | diag_2_35 | diag_2_351 | diag_2_352 | diag_2_353 | diag_2_354 | diag_2_355 | diag_2_356 | diag_2_357 | diag_2_358 | diag_2_359 | diag_2_360 | diag_2_362 | diag_2_365 | diag_2_368 | diag_2_369 | diag_2_372 | diag_2_373 | diag_2_374 | diag_2_376 | diag_2_377 | diag_2_378 | diag_2_379 | diag_2_38 | diag_2_380 | diag_2_381 | diag_2_382 | diag_2_383 | diag_2_386 | diag_2_389 | diag_2_394 | diag_2_395 | diag_2_396 | diag_2_397 | diag_2_398 | diag_2_40 | diag_2_401 | diag_2_402 | diag_2_403 | diag_2_404 | diag_2_405 | diag_2_41 | diag_2_410 | diag_2_411 | diag_2_412 | diag_2_413 | diag_2_414 | diag_2_415 | diag_2_416 | diag_2_42 | diag_2_420 | diag_2_421 | diag_2_422 | diag_2_423 | diag_2_424 | diag_2_425 | diag_2_426 | diag_2_427 | diag_2_428 | diag_2_429 | diag_2_430 | diag_2_431 | diag_2_432 | diag_2_433 | diag_2_434 | diag_2_435 | diag_2_436 | diag_2_437 | diag_2_438 | diag_2_440 | diag_2_441 | diag_2_442 | diag_2_443 | diag_2_444 | diag_2_446 | diag_2_447 | diag_2_451 | diag_2_452 | diag_2_453 | diag_2_454 | diag_2_455 | diag_2_456 | diag_2_457 | diag_2_458 | diag_2_459 | diag_2_46 | diag_2_460 | diag_2_461 | diag_2_462 | diag_2_463 | diag_2_464 | diag_2_465 | diag_2_466 | diag_2_470 | diag_2_472 | diag_2_473 | diag_2_474 | diag_2_477 | diag_2_478 | diag_2_480 | diag_2_481 | diag_2_482 | diag_2_484 | diag_2_485 | diag_2_486 | diag_2_487 | diag_2_490 | diag_2_491 | diag_2_492 | diag_2_493 | diag_2_494 | diag_2_495 | diag_2_496 | diag_2_500 | diag_2_501 | diag_2_507 | diag_2_508 | diag_2_510 | diag_2_511 | diag_2_512 | diag_2_514 | diag_2_515 | diag_2_516 | diag_2_517 | diag_2_518 | diag_2_519 | diag_2_520 | diag_2_521 | diag_2_522 | diag_2_523 | diag_2_524 | diag_2_527 | diag_2_528 | diag_2_529 | diag_2_53 | diag_2_530 | diag_2_531 | diag_2_532 | diag_2_533 | diag_2_534 | diag_2_535 | diag_2_536 | diag_2_537 | diag_2_54 | diag_2_540 | diag_2_542 | diag_2_543 | diag_2_550 | diag_2_552 | diag_2_553 | diag_2_555 | diag_2_556 | diag_2_557 | diag_2_558 | diag_2_560 | diag_2_562 | diag_2_564 | diag_2_565 | diag_2_566 | diag_2_567 | diag_2_568 | diag_2_569 | diag_2_570 | diag_2_571 | diag_2_572 | diag_2_573 | diag_2_574 | diag_2_575 | diag_2_576 | diag_2_577 | diag_2_578 | diag_2_579 | diag_2_580 | diag_2_581 | diag_2_583 | diag_2_584 | diag_2_585 | diag_2_586 | diag_2_588 | diag_2_590 | diag_2_591 | diag_2_592 | diag_2_593 | diag_2_594 | diag_2_595 | diag_2_596 | diag_2_598 | diag_2_599 | diag_2_600 | diag_2_601 | diag_2_602 | diag_2_603 | diag_2_604 | diag_2_605 | diag_2_607 | diag_2_608 | diag_2_611 | diag_2_614 | diag_2_615 | diag_2_616 | diag_2_617 | diag_2_618 | diag_2_619 | diag_2_620 | diag_2_621 | diag_2_622 | diag_2_623 | diag_2_625 | diag_2_626 | diag_2_627 | diag_2_634 | diag_2_641 | diag_2_642 | diag_2_644 | diag_2_645 | diag_2_646 | diag_2_647 | diag_2_648 | diag_2_649 | diag_2_652 | diag_2_654 | diag_2_656 | diag_2_658 | diag_2_659 | diag_2_661 | diag_2_663 | diag_2_664 | diag_2_665 | diag_2_670 | diag_2_674 | diag_2_680 | diag_2_681 | diag_2_682 | diag_2_683 | diag_2_686 | diag_2_691 | diag_2_692 | diag_2_693 | diag_2_696 | diag_2_698 | diag_2_70 | diag_2_701 | diag_2_702 | diag_2_703 | diag_2_704 | diag_2_705 | diag_2_706 | diag_2_707 | diag_2_709 | diag_2_710 | diag_2_711 | diag_2_712 | diag_2_713 | diag_2_714 | diag_2_715 | diag_2_716 | diag_2_717 | diag_2_718 | diag_2_719 | diag_2_721 | diag_2_722 | diag_2_723 | diag_2_724 | diag_2_725 | diag_2_726 | diag_2_727 | diag_2_728 | diag_2_729 | diag_2_730 | diag_2_731 | diag_2_733 | diag_2_734 | diag_2_736 | diag_2_737 | diag_2_738 | diag_2_741 | diag_2_742 | diag_2_745 | diag_2_746 | diag_2_747 | diag_2_748 | diag_2_750 | diag_2_751 | diag_2_752 | diag_2_753 | diag_2_755 | diag_2_756 | diag_2_758 | diag_2_759 | diag_2_780 | diag_2_781 | diag_2_782 | diag_2_783 | diag_2_784 | diag_2_785 | diag_2_786 | diag_2_787 | diag_2_788 | diag_2_789 | diag_2_79 | diag_2_790 | diag_2_791 | diag_2_792 | diag_2_793 | diag_2_794 | diag_2_795 | diag_2_796 | diag_2_797 | diag_2_799 | diag_2_8 | diag_2_800 | diag_2_801 | diag_2_802 | diag_2_805 | diag_2_807 | diag_2_808 | diag_2_810 | diag_2_812 | diag_2_813 | diag_2_814 | diag_2_815 | diag_2_816 | diag_2_820 | diag_2_821 | diag_2_822 | diag_2_823 | diag_2_824 | diag_2_825 | diag_2_826 | diag_2_831 | diag_2_832 | diag_2_833 | diag_2_836 | diag_2_837 | diag_2_840 | diag_2_842 | diag_2_843 | diag_2_844 | diag_2_845 | diag_2_847 | diag_2_850 | diag_2_851 | diag_2_852 | diag_2_853 | diag_2_860 | diag_2_861 | diag_2_863 | diag_2_864 | diag_2_865 | diag_2_866 | diag_2_867 | diag_2_868 | diag_2_869 | diag_2_870 | diag_2_871 | diag_2_872 | diag_2_873 | diag_2_88 | diag_2_880 | diag_2_881 | diag_2_882 | diag_2_883 | diag_2_891 | diag_2_892 | diag_2_9 | diag_2_905 | diag_2_906 | diag_2_907 | diag_2_909 | diag_2_910 | diag_2_911 | diag_2_913 | diag_2_915 | diag_2_916 | diag_2_918 | diag_2_919 | diag_2_920 | diag_2_921 | diag_2_922 | diag_2_923 | diag_2_924 | diag_2_927 | diag_2_933 | diag_2_934 | diag_2_94 | diag_2_942 | diag_2_944 | diag_2_945 | diag_2_947 | diag_2_948 | diag_2_952 | diag_2_953 | diag_2_955 | diag_2_958 | diag_2_959 | diag_2_962 | diag_2_965 | diag_2_967 | diag_2_969 | diag_2_972 | diag_2_980 | diag_2_989 | diag_2_99 | diag_2_991 | diag_2_992 | diag_2_995 | diag_2_996 | diag_2_997 | diag_2_998 | diag_2_999 | diag_2_ | diag_2_E812 | diag_2_E814 | diag_2_E816 | diag_2_E819 | diag_2_E821 | diag_2_E826 | diag_2_E849 | diag_2_E850 | diag_2_E858 | diag_2_E868 | diag_2_E870 | diag_2_E878 | diag_2_E879 | diag_2_E880 | diag_2_E881 | diag_2_E882 | diag_2_E883 | diag_2_E884 | diag_2_E885 | diag_2_E887 | diag_2_E888 | diag_2_E890 | diag_2_E905 | diag_2_E906 | diag_2_E916 | diag_2_E917 | diag_2_E918 | diag_2_E919 | diag_2_E924 | diag_2_E927 | diag_2_E928 | diag_2_E930 | diag_2_E931 | diag_2_E932 | diag_2_E933 | diag_2_E934 | diag_2_E935 | diag_2_E936 | diag_2_E937 | diag_2_E938 | diag_2_E939 | diag_2_E942 | diag_2_E944 | diag_2_E947 | diag_2_E950 | diag_2_E965 | diag_2_E968 | diag_2_E980 | diag_2_V02 | diag_2_V03 | diag_2_V08 | diag_2_V09 | diag_2_V10 | diag_2_V11 | diag_2_V12 | diag_2_V13 | diag_2_V14 | diag_2_V15 | diag_2_V16 | diag_2_V17 | diag_2_V18 | diag_2_V23 | diag_2_V25 | diag_2_V42 | diag_2_V43 | diag_2_V44 | diag_2_V45 | diag_2_V46 | diag_2_V49 | diag_2_V50 | diag_2_V53 | diag_2_V54 | diag_2_V57 | diag_2_V58 | diag_2_V60 | diag_2_V61 | diag_2_V62 | diag_2_V63 | diag_2_V64 | diag_2_V65 | diag_2_V70 | diag_2_V72 | diag_2_V85 | diag_2_V86 | diag_3_112 | diag_3_115 | diag_3_117 | diag_3_122 | diag_3_123 | diag_3_131 | diag_3_135 | diag_3_136 | diag_3_138 | diag_3_139 | diag_3_14 | diag_3_141 | diag_3_150 | diag_3_151 | diag_3_152 | diag_3_153 | diag_3_154 | diag_3_155 | diag_3_156 | diag_3_157 | diag_3_158 | diag_3_161 | diag_3_162 | diag_3_163 | diag_3_17 | diag_3_170 | diag_3_171 | diag_3_172 | diag_3_173 | diag_3_174 | diag_3_179 | diag_3_180 | diag_3_182 | diag_3_183 | diag_3_185 | diag_3_188 | diag_3_189 | diag_3_191 | diag_3_192 | diag_3_193 | diag_3_196 | diag_3_197 | diag_3_198 | diag_3_199 | diag_3_200 | diag_3_201 | diag_3_202 | diag_3_203 | diag_3_204 | diag_3_205 | diag_3_208 | diag_3_211 | diag_3_214 | diag_3_217 | diag_3_218 | diag_3_220 | diag_3_223 | diag_3_225 | diag_3_227 | diag_3_228 | diag_3_233 | diag_3_235 | diag_3_238 | diag_3_239 | diag_3_240 | diag_3_241 | diag_3_242 | diag_3_244 | diag_3_245 | diag_3_246 | diag_3_250 | diag_3_25001 | diag_3_25002 | diag_3_25003 | diag_3_2501 | diag_3_25011 | diag_3_25012 | diag_3_25013 | diag_3_2502 | diag_3_25022 | diag_3_2503 | diag_3_25031 | diag_3_2504 | diag_3_25041 | diag_3_25042 | diag_3_25043 | diag_3_2505 | diag_3_25051 | diag_3_25052 | diag_3_25053 | diag_3_2506 | diag_3_2507 | diag_3_2508 | diag_3_25081 | diag_3_25082 | diag_3_25083 | diag_3_2509 | diag_3_25091 | diag_3_25092 | diag_3_25093 | diag_3_251 | diag_3_252 | diag_3_253 | diag_3_255 | diag_3_256 | diag_3_258 | diag_3_259 | diag_3_261 | diag_3_262 | diag_3_263 | diag_3_265 | diag_3_266 | diag_3_268 | diag_3_27 | diag_3_270 | diag_3_271 | diag_3_272 | diag_3_273 | diag_3_274 | diag_3_275 | diag_3_276 | diag_3_277 | diag_3_278 | diag_3_279 | diag_3_280 | diag_3_281 | diag_3_282 | diag_3_283 | diag_3_284 | diag_3_285 | diag_3_286 | diag_3_287 | diag_3_288 | diag_3_289 | diag_3_290 | diag_3_291 | diag_3_292 | diag_3_293 | diag_3_294 | diag_3_295 | diag_3_296 | diag_3_297 | diag_3_298 | diag_3_299 | diag_3_300 | diag_3_301 | diag_3_303 | diag_3_304 | diag_3_305 | diag_3_306 | diag_3_307 | diag_3_309 | diag_3_310 | diag_3_311 | diag_3_312 | diag_3_313 | diag_3_314 | diag_3_315 | diag_3_317 | diag_3_318 | diag_3_319 | diag_3_323 | diag_3_327 | diag_3_331 | diag_3_332 | diag_3_333 | diag_3_334 | diag_3_335 | diag_3_336 | diag_3_337 | diag_3_338 | diag_3_34 | diag_3_340 | diag_3_341 | diag_3_342 | diag_3_343 | diag_3_344 | diag_3_345 | diag_3_346 | diag_3_347 | diag_3_348 | diag_3_349 | diag_3_35 | diag_3_350 | diag_3_351 | diag_3_353 | diag_3_354 | diag_3_355 | diag_3_356 | diag_3_357 | diag_3_358 | diag_3_359 | diag_3_360 | diag_3_361 | diag_3_362 | diag_3_365 | diag_3_36544 | diag_3_366 | diag_3_368 | diag_3_369 | diag_3_370 | diag_3_372 | diag_3_373 | diag_3_374 | diag_3_376 | diag_3_377 | diag_3_378 | diag_3_379 | diag_3_38 | diag_3_380 | diag_3_381 | diag_3_382 | diag_3_383 | diag_3_384 | diag_3_385 | diag_3_386 | diag_3_387 | diag_3_388 | diag_3_389 | diag_3_391 | diag_3_394 | diag_3_395 | diag_3_396 | diag_3_397 | diag_3_398 | diag_3_401 | diag_3_402 | diag_3_403 | diag_3_404 | diag_3_405 | diag_3_41 | diag_3_410 | diag_3_411 | diag_3_412 | diag_3_413 | diag_3_414 | diag_3_415 | diag_3_416 | diag_3_417 | diag_3_42 | diag_3_420 | diag_3_421 | diag_3_423 | diag_3_424 | diag_3_425 | diag_3_426 | diag_3_427 | diag_3_428 | diag_3_429 | diag_3_430 | diag_3_431 | diag_3_432 | diag_3_433 | diag_3_434 | diag_3_435 | diag_3_436 | diag_3_437 | diag_3_438 | diag_3_440 | diag_3_441 | diag_3_442 | diag_3_443 | diag_3_444 | diag_3_446 | diag_3_447 | diag_3_451 | diag_3_452 | diag_3_453 | diag_3_454 | diag_3_455 | diag_3_456 | diag_3_457 | diag_3_458 | diag_3_459 | diag_3_460 | diag_3_461 | diag_3_462 | diag_3_463 | diag_3_464 | diag_3_465 | diag_3_466 | diag_3_47 | diag_3_470 | diag_3_472 | diag_3_473 | diag_3_477 | diag_3_478 | diag_3_480 | diag_3_482 | diag_3_486 | diag_3_487 | diag_3_49 | diag_3_490 | diag_3_491 | diag_3_492 | diag_3_493 | diag_3_494 | diag_3_495 | diag_3_496 | diag_3_5 | diag_3_500 | diag_3_501 | diag_3_506 | diag_3_507 | diag_3_508 | diag_3_510 | diag_3_511 | diag_3_512 | diag_3_514 | diag_3_515 | diag_3_516 | diag_3_517 | diag_3_518 | diag_3_519 | diag_3_521 | diag_3_522 | diag_3_523 | diag_3_524 | diag_3_525 | diag_3_527 | diag_3_528 | diag_3_529 | diag_3_53 | diag_3_530 | diag_3_531 | diag_3_532 | diag_3_533 | diag_3_534 | diag_3_535 | diag_3_536 | diag_3_537 | diag_3_54 | diag_3_540 | diag_3_542 | diag_3_543 | diag_3_550 | diag_3_552 | diag_3_553 | diag_3_555 | diag_3_556 | diag_3_557 | diag_3_558 | diag_3_560 | diag_3_562 | diag_3_564 | diag_3_565 | diag_3_566 | diag_3_567 | diag_3_568 | diag_3_569 | diag_3_57 | diag_3_570 | diag_3_571 | diag_3_572 | diag_3_573 | diag_3_574 | diag_3_575 | diag_3_576 | diag_3_577 | diag_3_578 | diag_3_579 | diag_3_580 | diag_3_581 | diag_3_582 | diag_3_583 | diag_3_584 | diag_3_585 | diag_3_586 | diag_3_588 | diag_3_590 | diag_3_591 | diag_3_592 | diag_3_593 | diag_3_594 | diag_3_595 | diag_3_596 | diag_3_597 | diag_3_598 | diag_3_599 | diag_3_600 | diag_3_601 | diag_3_602 | diag_3_604 | diag_3_605 | diag_3_607 | diag_3_608 | diag_3_610 | diag_3_611 | diag_3_614 | diag_3_616 | diag_3_617 | diag_3_618 | diag_3_619 | diag_3_620 | diag_3_621 | diag_3_623 | diag_3_624 | diag_3_625 | diag_3_626 | diag_3_627 | diag_3_641 | diag_3_642 | diag_3_644 | diag_3_646 | diag_3_647 | diag_3_648 | diag_3_649 | diag_3_652 | diag_3_653 | diag_3_654 | diag_3_655 | diag_3_656 | diag_3_657 | diag_3_658 | diag_3_659 | diag_3_660 | diag_3_661 | diag_3_663 | diag_3_664 | diag_3_665 | diag_3_669 | diag_3_670 | diag_3_674 | diag_3_680 | diag_3_681 | diag_3_682 | diag_3_685 | diag_3_686 | diag_3_690 | diag_3_692 | diag_3_693 | diag_3_694 | diag_3_695 | diag_3_696 | diag_3_697 | diag_3_698 | diag_3_70 | diag_3_701 | diag_3_702 | diag_3_703 | diag_3_704 | diag_3_705 | diag_3_706 | diag_3_707 | diag_3_708 | diag_3_709 | diag_3_710 | diag_3_711 | diag_3_712 | diag_3_713 | diag_3_714 | diag_3_715 | diag_3_716 | diag_3_717 | diag_3_718 | diag_3_719 | diag_3_720 | diag_3_721 | diag_3_722 | diag_3_723 | diag_3_724 | diag_3_725 | diag_3_726 | diag_3_727 | diag_3_728 | diag_3_729 | diag_3_730 | diag_3_731 | diag_3_733 | diag_3_734 | diag_3_735 | diag_3_736 | diag_3_737 | diag_3_738 | diag_3_741 | diag_3_742 | diag_3_744 | diag_3_745 | diag_3_746 | diag_3_747 | diag_3_750 | diag_3_751 | diag_3_752 | diag_3_753 | diag_3_754 | diag_3_755 | diag_3_756 | diag_3_757 | diag_3_758 | diag_3_759 | diag_3_78 | diag_3_780 | diag_3_781 | diag_3_782 | diag_3_783 | diag_3_784 | diag_3_785 | diag_3_786 | diag_3_787 | diag_3_788 | diag_3_789 | diag_3_79 | diag_3_790 | diag_3_791 | diag_3_792 | diag_3_793 | diag_3_794 | diag_3_795 | diag_3_796 | diag_3_797 | diag_3_799 | diag_3_8 | diag_3_800 | diag_3_801 | diag_3_802 | diag_3_805 | diag_3_807 | diag_3_808 | diag_3_810 | diag_3_812 | diag_3_813 | diag_3_814 | diag_3_815 | diag_3_816 | diag_3_820 | diag_3_821 | diag_3_822 | diag_3_823 | diag_3_824 | diag_3_825 | diag_3_826 | diag_3_831 | diag_3_834 | diag_3_836 | diag_3_837 | diag_3_838 | diag_3_840 | diag_3_841 | diag_3_842 | diag_3_844 | diag_3_845 | diag_3_847 | diag_3_850 | diag_3_851 | diag_3_852 | diag_3_853 | diag_3_854 | diag_3_860 | diag_3_861 | diag_3_862 | diag_3_864 | diag_3_865 | diag_3_866 | diag_3_867 | diag_3_868 | diag_3_871 | diag_3_872 | diag_3_873 | diag_3_875 | diag_3_876 | diag_3_877 | diag_3_880 | diag_3_881 | diag_3_882 | diag_3_883 | diag_3_890 | diag_3_891 | diag_3_892 | diag_3_9 | diag_3_905 | diag_3_906 | diag_3_907 | diag_3_908 | diag_3_909 | diag_3_910 | diag_3_911 | diag_3_912 | diag_3_913 | diag_3_916 | diag_3_917 | diag_3_918 | diag_3_919 | diag_3_920 | diag_3_921 | diag_3_922 | diag_3_923 | diag_3_924 | diag_3_928 | diag_3_930 | diag_3_933 | diag_3_934 | diag_3_94 | diag_3_943 | diag_3_945 | diag_3_948 | diag_3_951 | diag_3_952 | diag_3_953 | diag_3_955 | diag_3_956 | diag_3_958 | diag_3_959 | diag_3_962 | diag_3_965 | diag_3_966 | diag_3_967 | diag_3_969 | diag_3_970 | diag_3_971 | diag_3_972 | diag_3_987 | diag_3_989 | diag_3_991 | diag_3_995 | diag_3_996 | diag_3_997 | diag_3_998 | diag_3_999 | diag_3_ | diag_3_E812 | diag_3_E813 | diag_3_E815 | diag_3_E816 | diag_3_E817 | diag_3_E818 | diag_3_E819 | diag_3_E822 | diag_3_E825 | diag_3_E826 | diag_3_E828 | diag_3_E849 | diag_3_E850 | diag_3_E852 | diag_3_E853 | diag_3_E858 | diag_3_E861 | diag_3_E870 | diag_3_E878 | diag_3_E879 | diag_3_E880 | diag_3_E881 | diag_3_E882 | diag_3_E883 | diag_3_E884 | diag_3_E885 | diag_3_E886 | diag_3_E887 | diag_3_E888 | diag_3_E892 | diag_3_E894 | diag_3_E900 | diag_3_E905 | diag_3_E906 | diag_3_E912 | diag_3_E916 | diag_3_E917 | diag_3_E919 | diag_3_E920 | diag_3_E922 | diag_3_E924 | diag_3_E927 | diag_3_E928 | diag_3_E929 | diag_3_E930 | diag_3_E931 | diag_3_E932 | diag_3_E933 | diag_3_E934 | diag_3_E935 | diag_3_E936 | diag_3_E937 | diag_3_E938 | diag_3_E939 | diag_3_E941 | diag_3_E942 | diag_3_E943 | diag_3_E944 | diag_3_E946 | diag_3_E947 | diag_3_E949 | diag_3_E950 | diag_3_E956 | diag_3_E966 | diag_3_E980 | diag_3_V02 | diag_3_V03 | diag_3_V06 | diag_3_V08 | diag_3_V09 | diag_3_V10 | diag_3_V11 | diag_3_V12 | diag_3_V13 | diag_3_V14 | diag_3_V15 | diag_3_V16 | diag_3_V17 | diag_3_V18 | diag_3_V23 | diag_3_V25 | diag_3_V27 | diag_3_V42 | diag_3_V43 | diag_3_V44 | diag_3_V45 | diag_3_V46 | diag_3_V49 | diag_3_V53 | diag_3_V54 | diag_3_V55 | diag_3_V57 | diag_3_V58 | diag_3_V60 | diag_3_V61 | diag_3_V62 | diag_3_V63 | diag_3_V64 | diag_3_V65 | diag_3_V66 | diag_3_V70 | diag_3_V72 | diag_3_V85 | diag_3_V86 | max_glu_serum_300 | max_glu_serum_None | max_glu_serum_Norm | A1Cresult_8 | A1Cresult_None | A1Cresult_Norm | metformin_No | metformin_Steady | metformin_Up | repaglinide_No | repaglinide_Steady | repaglinide_Up | nateglinide_Steady | nateglinide_Up | chlorpropamide_Steady | glimepiride_No | glimepiride_Steady | glimepiride_Up | glipizide_No | glipizide_Steady | glipizide_Up | glyburide_No | glyburide_Steady | glyburide_Up | tolbutamide_Steady | pioglitazone_No | pioglitazone_Steady | pioglitazone_Up | rosiglitazone_No | rosiglitazone_Steady | rosiglitazone_Up | acarbose_Steady | acarbose_Up | miglitol_Steady | tolazamide_Steady | insulin_No | insulin_Steady | insulin_Up | glyburidemetformin_Steady | change_No | diabetesMed_Yes | group_diag_1_Diabetes | group_diag_1_Digestive | group_diag_1_Genitourinary | group_diag_1_Injury | group_diag_1_Musculoskeletal | group_diag_1_Neoplasms | group_diag_1_Other | group_diag_1_Respiratory | group_diag_1_na | group_diag_2_Diabetes | group_diag_2_Digestive | group_diag_2_Genitourinary | group_diag_2_Injury | group_diag_2_Musculoskeletal | group_diag_2_Neoplasms | group_diag_2_Other | group_diag_2_Respiratory | group_diag_2_na | group_diag_3_Diabetes | group_diag_3_Digestive | group_diag_3_Genitourinary | group_diag_3_Injury | group_diag_3_Musculoskeletal | group_diag_3_Neoplasms | group_diag_3_Other | group_diag_3_Respiratory | group_diag_3_na |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 77 | 6 | 33 | 0 | 0 | 0 | 8 | 1 | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | True | False | False | False | False | False | True | False | False | True | False | False | False | False | False | False | True | False | False | True | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | False | False | False |
# Generate the samples once again
# Split dataset in 70% training, 15% validation and 15% test and devide into feature matrix x and label y
X = df.drop(['TARGET'], axis=1)
y = df.TARGET
X_train, X_valtest, y_train, y_valtest = train_test_split(X, y, train_size=0.70, random_state=RANDOM_SEED)
X_val, X_test, y_val, y_test = train_test_split(X_valtest, y_valtest, test_size=0.50, random_state=RANDOM_SEED)
print('Sample sizes for training, validation and test: ', X_train.shape, X_val.shape, X_test.shape)
Sample sizes for training, validation and test: (33425, 2248) (7163, 2248) (7163, 2248)
Scaling is vital in machine learning for consistency, improved performance, and meeting algorithm requirements:
Uniformity: Diverse features can have varying scales, such as age ranging from 0 to 100 and income spanning thousands to millions. This can mislead algorithms into overemphasizing features with larger ranges.
Efficiency: Algorithms such as Gradient Descent and K-Nearest-Neighbors (KNN) benefit from features on a similar scale, resulting in better performance and quicker convergence. For example, KNN is heavily influenced by features with larger magnitudes.
Normalization Requirement: Certain machine learning algorithms, such as neural networks, require inputs to be scaled for model training to effectively proceed.
Typical scaling techniques include Min-Max Scaling, which adjusts features to a 0 to 1 range, and Standard Scaling, which sets features to zero mean and unit variance. The appropriate scaling approach is contingent upon the algorithm utilized and data characteristics. We will employ Standard Scaling in our analysis.
# Generate list: feature names
feature_names=list(X.columns)
# Feature scaling using StandardScaler based on training data distributions
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns = feature_names)
X_val = pd.DataFrame(scaler.transform(X_val), columns = feature_names)
We note that standard scaling was also applied to the one-hot encoded categorical features, which is uncommon in practice, as these features are already binary (i.e., values in {0, 1}). However, due to the presence of an intercept in the model, this transformation has no effect on the model's ability to learn and thus does not impact performance.
6.2 Do encoding and scaling affect the benchmark model?
# Repeat CB with one-hot-encoded and scaled data
start_time = time.time()
CB4 = CatBoostClassifier(eval_metric='AUC', random_seed=RANDOM_SEED)
CB4.fit(X_train, y_train, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB4 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB4:.2f}")
Elapsed time (sec): 22.69
# CatBoost: Plot most the top 20 most important features
pd.Series(CB4.feature_importances_, index=X_train.columns).nlargest(20).plot(kind='barh', color=COLOR_DARK).invert_yaxis()
plt.show()
Compared to the corresponding feature importance plot in Subsection 1.5, we now observe that all of the top five features are numerical. However, this does not imply that the previously identified top categorical features are unimportant; rather, their importance is now distributed across multiple columns due to one-hot encoding. As a result, their individual contributions may appear diminished in the feature importance ranking, which can complicate interpretation and feature selection in models that rely on such metrics.
To better assess the overall importance of a categorical variable, one can aggregate the importance scores of its one-hot encoded columns or use alternative encoding methods (e.g., embeddings in neural networks which we will study later) that retain categorical structure and allow more direct interpretation.
# Calculate and dispay the model AUC
auc_CB4 = roc_auc_score(y_val, CB4.predict_proba(X_val)[:,1])
print('The validation AUC of the CatBoost model after encoding and scaling is: {:.6f}'.format(auc_CB4))
print('The validation AUC of the CatBoost model before encoding and scaling was: {:.6f}'.format(auc_CB1))
The validation AUC of the CatBoost model after encoding and scaling is: 0.702311 The validation AUC of the CatBoost model before encoding and scaling was: 0.707713
print(f"Elapsed time of the CatBoost model after encoding and scaling (sec): {elapsed_time_CB4:.2f}")
print(f"Elapsed time of the CatBoost model before encoding and scaling (sec): {elapsed_time_CB1:.2f}")
Elapsed time of the CatBoost model after encoding and scaling (sec): 22.69 Elapsed time of the CatBoost model before encoding and scaling (sec): 157.66
Summary and interpretation: Although the encoding of the categorical data increased the number of features by a factor of 45 (50 before vs. 2248 after encoding), the training time of CatBoost decreased substancially while the AUC level decreased just slightly. Apparently, the direct use of the categorical features within CatBoost leads to a greatly increased computational effort, which, however, only leads to a minimal improvement in the prediction quality.
6.3 Subsampling the training data
Fortunately, readmission is a rare event and thus the dataset is highly imbalanced with respect to our TARGET variable. Currently, one readmission is offset by an average of 6.6 non-readmission in the training data. We would now like to investigate whether this overweight can be reduced to, for example, 1:2 without a deterioration in the resulting model quality and whether the reduced data volume can accelerate model training.
non_event_factor = 2 # thus the event rate should be 33.3% (instead of 13.1%)
# Save complete training data
X_train_all = X_train.copy(deep=True)
y_train_all = y_train.copy(deep=True)
# Subsampling: enhance event rate to 1:nov_event_factor
Xs = pd.concat([X_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)
Xs = pd.concat([Xs[Xs['TARGET']==1], Xs[Xs['TARGET']==0].sample(
n = non_event_factor * len(Xs[Xs['TARGET']==1] ), random_state=RANDOM_SEED)],axis=0).sort_index(ascending=True)
# Over-write complete training data with subsample
X_train = Xs.drop(columns=["TARGET"])
y_train = Xs["TARGET"]
print("\nSize of training data set after subsampling: ", X_train.shape)
print("\nTARGET-distribution after subsampling:\n", y_train.value_counts())
print("\nReduction of data volume (%): ", round((len(y_train)/len(y_train_all)-1)*100))
Size of training data set after subsampling: (13176, 2248) TARGET-distribution after subsampling: TARGET 0 8784 1 4392 Name: count, dtype: int64 Reduction of data volume (%): -61
6.4 Does subsampling affect the benchmark model?
# Repeat CB with subsampled data
start_time = time.time()
CB5 = CatBoostClassifier(eval_metric='AUC', random_seed=RANDOM_SEED)
CB5.fit(X_train, y_train, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB5 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB5:.2f}")
Elapsed time (sec): 21.06
# Calculate and dispay the model AUC
auc_CB5 = roc_auc_score(y_val, CB5.predict_proba(X_val)[:,1])
print('The validation AUC of the CatBoost model with subsampled data is: {:.6f}'.format(auc_CB5))
The validation AUC of the CatBoost model with subsampled data is: 0.699061
# Calculate relative performance
print("Reduction of training time (%): ", round((elapsed_time_CB5/elapsed_time_CB4-1)*100))
print("Improvment in AUC score (%): ", round((auc_CB5/auc_CB4-1)*100, 2))
Reduction of training time (%): -7 Improvment in AUC score (%): -0.46
Summary and conclusion: Using a 1:2 ratio (i.e., non_event_factor of 2) reduces CatBoost training time by at least 20%, with minimal impact on model performance. Thus, subsampling training data can be an effective strategy for speeding up model training, particularly during hyperparameter tuning.
While one might argue that this extreme undersampling of the majority class could discard valuable information. However, the performance observed on the validation data suggests that this is not a concern in our case.
Moving forward, we focus on this subsampled dataset, which reflects an increased event rate of 33.3%.
6.5 Model evaluation and comparison
Prepare model comparison:
# Data structures for model names, AUC, as well as training time
mname = []
mauc = []
mtrainingtime = []
dict = {'Model name': mname,'AUC': mauc, 'Training time': mtrainingtime}
# Store name and AUC of prevoisly fitted models
mname.append("CB1_quick")
mauc.append(auc_CB1)
mtrainingtime.append(elapsed_time_CB1)
mname.append("LR1_select")
mauc.append(auc_LR1)
mtrainingtime.append(elapsed_time_LR1)
mname.append("CB4_OHE")
mauc.append(auc_CB4)
mtrainingtime.append(elapsed_time_CB4)
mname.append("CB5_after subsampling")
mauc.append(auc_CB5)
mtrainingtime.append(elapsed_time_CB5)
def plot_model_performance(d, x1, x2, t):
df_eval = pd.DataFrame(d)
# Create a figure with 1 row and 2 columns
plt.figure(figsize=(20, 6))
# First subplot (left): Model AUCs
ax1 = plt.subplot(1, 2, 1)
ax1.set_title(" Model evaluation (AUC): " + t)
sns.barplot(data = df_eval, x = "AUC", y = "Model name", color = COLOR_DARK)
ax1.set_xlim(x1, x2)
# Second subplot (right): Model training times
ax2 = plt.subplot(1, 2, 2)
ax2.set_title(" Model training time in second: " + t)
sns.barplot(data = df_eval, x = "Training time", y = "Model name", color = COLOR_DARK)
ax2.set_ylabel("")
plt.tight_layout()
plt.show()
Compare Model Performance (AUC - higher is better):
plot_model_performance(dict, 0.66, 0.72, "Validation")
Summary and conclusion: Among the models evaluated, the benchmark model consistently delivers the highest predictive performance, although it incurs significantly longer training times due to its computational complexity. Notably, the application of feature encoding as well as data subsampling for balancing purposes reduces training time by at least 80%, with no measurable loss in accuracy. Logistic regression, while slightly less accurate, achieves the fastest training time, illustrating a practical trade-off between model complexity and computational efficiency.
Task PT3: Adjusting and Extending Model Optimization [Learning objectives: 5.1 — Points: 3]] ¶
- a) Sections 7 and 8 are not to be processed and must be removed.
- b) Sections 9 and 10 must be processed and adjusted. Configure the settings appropriately based on the
hardware used (CPU/GPU). What modifications should be made to the hyperparameter grids (HP grids)? Interpret the results.
- c) Section 11 is to be skipped for now.
Solution:
Part C is all about harnessing the full potential of our models by:
- Fine-tuning hyperparameters for the gradient tree boosting model champions CatBoost, LightGBM, and XGBoost;
- Representing categorical features using embeddings in a neural network and fine-tuning these embeddings;
- Re-training the previous baseline models with the newly embedded features; and finally,
- Exploring AutoML, including model ensembles.
Following this, we provide a detailed comparative evaluation using validation and test datasets and explore the application of these models in high-risk domains. To wrap up, we summarize the pivotal insights gleaned from our machine learning exploration. As we draw to a close, we reflect on the key learnings from our machine learning journey.
As mentioned earlier, the CPU is sufficient for our task. Additionally, we will adjust the settings and hyperparameter grids accordingly, opting for a more efficient but less aggressive hyperparameter search if needed.
Now we want to investigate whether we can optimize the learning rate and tree depth of CatBoost by performing a costly cross-validated randomized search (fitting 40 models).
In the first code block, we are setting the stage for hyperparameter optimization of the CatBoostClassifier model. We define a parameter grid specifying the range of values to be explored for the learning rate and the depth of the trees. Here, loguniform is used for the learning rate to sample from a logarithmic distribution, optimizing the search over different scales. Similarly, a list of integers represents possible values for the depth of the trees.
# Define the hyperparameters grid to search
param_grid_CB ={'learning_rate': loguniform(0.01, 0.05),
'depth': [4,5,6,7]}
# Define variables for results reports
sel_params_CB = ['param_learning_rate','param_depth','mean_test_score','rank_test_score']
Following the initial setup, our next code block implements the actual hyperparameter optimization using RandomizedSearchCV, a strategy that randomly samples the parameter space and performs four-fold cross-validation. We choose a reasonable number of iterations (n_iter=10) to balance between computation time and thoroughness of the search. The results are captured, including the run time and the best parameters found.
# CatBoostClassifier: Hyper parameter optimization with RandomizedSearchCV
tic = time.time()
CB_rs = RandomizedSearchCV(CatBoostClassifier(iterations=1000, eval_metric='AUC'), param_grid_CB,
cv=4, n_iter=10, n_jobs=-1, random_state=RANDOM_SEED, scoring='roc_auc')
CB_rs.fit(X_train, y_train, logging_level='Silent')
# Print the runtime as well as the five best hyper parameter constellations
print("time (sec):" + "%6.0f" % (time.time() - tic))
print("Best hyper parameters:",CB_rs.best_params_)
pd.DataFrame(CB_rs.cv_results_)[sel_params_CB].sort_values("rank_test_score").head()
time (sec): 240
Best hyper parameters: {'depth': 6, 'learning_rate': 0.020941182421473574}
| param_learning_rate | param_depth | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 4 | 0.020941 | 6 | 0.692364 | 1 |
| 1 | 0.032482 | 6 | 0.692006 | 2 |
| 5 | 0.026312 | 4 | 0.691834 | 3 |
| 0 | 0.036038 | 6 | 0.691131 | 4 |
| 2 | 0.026132 | 4 | 0.690959 | 5 |
After identifying the best hyperparameters, we use them to fit a new CatBoostClassifier model with the entire training set. This approach leverages the optimized parameters that should theoretically yield a more accurate model.
# CatBoostClassifier: Fit new model on all folds (with best parameters)
start_time = time.time()
CB6 = CatBoostClassifier(**CB_rs.best_params_, iterations=1000, eval_metric='AUC')
CB6.fit(X_train, y_train, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB6 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB6:.2f}")
Elapsed time (sec): 18.48
Finally, we evaluate the performance of the tuned model by calculating the AUC score using the validation set.
# Calculate and dispay the model AUC
auc_CB6 = roc_auc_score(y_val, CB6.predict_proba(X_val)[:,1])
print('The validation AUC of the CatBoost model with tuned hyperparameters is: {:.6f}'.format(auc_CB6))
print('Comparison: The val. AUC of the CatBoost model with subsampled data is: {:.6f}'.format(auc_CB5))
The validation AUC of the CatBoost model with tuned hyperparameters is: 0.700497 Comparison: The val. AUC of the CatBoost model with subsampled data is: 0.699061
Although it requires drastically more training time, the performance of the tuned model is fortunately slightly better.
In this section, we turn our attention to CatBoost's primary competitors: LightGBM and XGBoost. Since our experiments are conducted on CPUs, we omit the device="cuda" setting. Given LightGBM's comparatively faster training time, we follow the original recommendation to increase model exploration by doubling the number of evaluated configurations. Furthermore, we adjust the hyperparameter grids accordingly to enhance the search for optimal model performance.
10.1 LightGBM: HP-tuning and evaluation
# Parameter search range for LightGBM
param_grid_LGB = {'learning_rate': loguniform(0.001, 0.02),
'num_leaves': [25,32,40,50],
'subsample': uniform(0,1),
'colsample_bytree': uniform(0,1),
'verbose': [-1]}
# Create a list of variables for displaying the cross-validation result
sel_params_LGB = ['param_learning_rate', 'param_num_leaves', 'param_subsample',
'param_colsample_bytree', 'mean_test_score', 'rank_test_score']
# LGBMClassifier: Hyperparameter optimization with RandomizedSearchCV
tic = time.time()
LGB_rs = RandomizedSearchCV(LGBMClassifier(n_estimators=1000),
param_grid_LGB, cv=4, n_iter=20, scoring='roc_auc',
n_jobs=-1, random_state=RANDOM_SEED)
LGB_rs.fit(X_train, y_train)
# Output the runtime and best parameters (as well as the runner-ups)
print("Elapsed time (sec):" + "%6.0f" % (time.time() - tic))
print("Best Parameters:", LGB_rs.best_params_)
pd.DataFrame(LGB_rs.cv_results_)[sel_params_LGB].sort_values("rank_test_score").head()
Elapsed time (sec): 360
Best Parameters: {'colsample_bytree': 0.24102546602601171, 'learning_rate': 0.007743661011598839, 'num_leaves': 50, 'subsample': 0.4951769101112702, 'verbose': -1}
| param_learning_rate | param_num_leaves | param_subsample | param_colsample_bytree | mean_test_score | rank_test_score | |
|---|---|---|---|---|---|---|
| 11 | 0.007744 | 50 | 0.495177 | 0.241025 | 0.690353 | 1 |
| 1 | 0.003802 | 40 | 0.058084 | 0.596850 | 0.690204 | 2 |
| 18 | 0.002320 | 50 | 0.275999 | 0.356753 | 0.688115 | 3 |
| 13 | 0.003574 | 32 | 0.546710 | 0.755361 | 0.687634 | 4 |
| 2 | 0.006054 | 50 | 0.650888 | 0.866176 | 0.687624 | 5 |
# LGBMClassifier: Create a new model on all folds using the best parameters
start_time = time.time()
LGB = LGBMClassifier(**LGB_rs.best_params_, n_estimators=1000)
LGB.fit(X_train, y_train)
# Calculate and print the running time in seconds
elapsed_time_LGB = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_LGB:.2f}")
Elapsed time (sec): 6.51
# Calculate and dispay the model AUC
auc_LGB = roc_auc_score(y_val, LGB.predict_proba(X_val)[:,1])
print('The validation AUC of the LightGBM model with tuned hyperparameters is: {:.6f}'.format(auc_LGB))
The validation AUC of the LightGBM model with tuned hyperparameters is: 0.705074
10.2 XGBoost: HP-tuning and evaluation
We remove device="cuda" to utilize CPUs and set tree_method='hist' for faster CPU-based histogram training. We retain the same number of iterations (n_iter=10) as used in tuning CatBoost to strike a balance between computational efficiency and the thoroughness of the search. Additionally, we again adjust the hyperparameter grids accordingly to enhance the search for optimal model performance.
# Parameter search range for XGBoost
param_grid_XGB = {'learning_rate': loguniform(0.005, 0.03),
'max_depth': [6,7,8,9],
'subsample': uniform(0,1),
'colsample_bytree': uniform(0,1)}
# Create a list of variables for displaying the cross-validation result
sel_params_XGB = ['param_learning_rate', 'param_max_depth', 'param_subsample',
'param_colsample_bytree', 'mean_test_score', 'rank_test_score']
# XGBClassifier: Hyperparameter optimization with RandomizedSearchCV
tic = time.time()
XGB_rs = RandomizedSearchCV(XGBClassifier(n_estimators=1000, tree_method="hist"),
param_grid_XGB, cv=4, n_iter=10, scoring="roc_auc",
n_jobs=-1, random_state=RANDOM_SEED)
XGB_rs.fit(X_train, y_train)
# Output the runtime and best parameters (as well as the runner-ups)
print("Elapsed time (sec):" + "%6.0f" % (time.time() - tic))
print("Best Parameters:", XGB_rs.best_params_)
pd.DataFrame(XGB_rs.cv_results_)[sel_params_XGB].sort_values("rank_test_score").head()
Elapsed time (sec): 587
Best Parameters: {'colsample_bytree': 0.18182496720710062, 'learning_rate': 0.006945227129221507, 'max_depth': 9, 'subsample': 0.6116531604882809}
| param_learning_rate | param_max_depth | param_subsample | param_colsample_bytree | mean_test_score | rank_test_score | |
|---|---|---|---|---|---|---|
| 4 | 0.006945 | 9 | 0.611653 | 0.181825 | 0.689892 | 1 |
| 8 | 0.005434 | 8 | 0.680308 | 0.592415 | 0.688936 | 2 |
| 7 | 0.015141 | 9 | 0.514234 | 0.090606 | 0.688423 | 3 |
| 0 | 0.027464 | 8 | 0.779691 | 0.374540 | 0.688003 | 4 |
| 2 | 0.014680 | 9 | 0.650888 | 0.866176 | 0.687881 | 5 |
# XGBClassifier: Create a new model on all folds using the best parameters
start_time = time.time()
XGB = XGBClassifier(**XGB_rs.best_params_, n_estimators=1000, tree_method="hist", eval_metric="auc")
XGB.fit(X_train, y_train)
# Calculate and print the running time in seconds
elapsed_time_XGB = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_XGB:.2f}")
Elapsed time (sec): 68.60
# Calculate and display the model AUC
auc_XGB = roc_auc_score(y_val, XGB.predict_proba(X_val)[:,1])
print('The validation AUC of the XGBoost model with tuned hyperparameters is: {:.6f}'.format(auc_XGB))
The validation AUC of the XGBoost model with tuned hyperparameters is: 0.703716
We will compare the performance of all models created including LightGBM and XGBoost at the end of the notebook.
Task PT4: Creating DXG Diagnosis Groups and Preparing for Embedding Generation ¶
Task PT4-1: [Learning objectives: 5.1 — Points: 5]
As a preparatory step for Sections PT4 and PT5, categorical features need to be adjusted to facilitate the creation of embeddings.
Note: This section of the task goes beyond the scope of the CSN notebook.
The following steps must be completed in the given order:
a) Load the dataset icd9_data.csv generated in Task R1 of Part I and display the first three rows.
b) Merge the original dataset (from Task PT1, Section 1.2) with the dataset from step a). Retain only the
columns group, L2codeand category from the dataset in step a). Then, display three rows of the new
dataset.
c) Based on the dataset from PT4-1 b): Which categories have the highest absolute proportions in relation to readmissions? Output the top three categories.
d) The feature category contains values that, in some cases, appear infrequently. To generate stable embeddings for this feature as well, proceed as follows:
- Add a new column
sum_target_cgto the dataset from step b), which should be filled using the results
from part c). This way, the values of category should be assigned their respective absolute proportions.
- Finally, add the new column
hdiag: If the category of a row appears more than 20 times in the dataset,
the value of category should be retained. Otherwise, the value should be a combination of the string
"o." and the value of the feature group.
- Trim the contents of the new column to a maximum of 30 characters. Verify the results by checking two
sample rows.
Solution:
Task PT4-1 a)
# Load the dataset `icd9_data.csv` into a pandas DataFrame
icd9 = pd.read_csv("../02_tidy_data/icd9_data.csv", # UPDATE this path if needed
na_filter=False) # no `NA` values in our prepared dataset
# Display the first three rows from the dataset to inspect data
icd9.head(n=3)
| category_id | code | group | chapter | chapter_id | L2code | category | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | ? | n.a. | n.a | 0 | n.a. | Not available |
| 1 | 135 | 1 | Digestive | Diseases of the digestive system | 9 | c9.1 | Intestinal infection |
| 2 | 1 | 10 | Other | Infectious and parasitic diseases | 1 | c1.1 | Tuberculosis |
Task PT4-1 b)
We will merge the datasets on all three diagnosis codes (diag_1, diag_2, and diag_3), and name the associated new features such as the corresponding categories using labels like category_x as appropriate.
# Extract the joining variables of the original dataset
diag_cols = ['diag_1', 'diag_2', 'diag_3']
# Columns to retain from the icd9 data
cols_to_retain = ['group', 'L2code', 'category']
# Define a helper function to merge each diagnosis `diag_col` with the respective retained column `icd9_col`
def merge_diag_column(df, icd9_col, diag_col):
return df.merge(icd9[['code', icd9_col]],
left_on=diag_col, right_on='code', how='left')\
.rename(columns={
icd9_col: f'{icd9_col}_{diag_col}'
})\
.drop(columns='code')
# Copy the original dataset and drop the column `group` for consistency
df_pre2 = df_raw.copy().drop(columns=[f'group_{diag}' for diag in diag_cols])
# Merge the datasets
for icd9_col in cols_to_retain:
for diag_col in diag_cols:
df_pre2 = merge_diag_column(df_pre2, icd9_col, diag_col)
# Replace missing values with the same logic as the dataset `icd9`
for i in range(1, 4):
df_pre2[f'group_diag_{i}'] = df_pre2[f'group_diag_{i}'].fillna('n.a.')
df_pre2[f'L2code_diag_{i}'] = df_pre2[f'L2code_diag_{i}'].fillna('n.a.')
df_pre2[f'category_diag_{i}'] = df_pre2[f'category_diag_{i}'].fillna('Not Available')
# Display the first three rows of the merged dataset
df_pre2.head(n=3)
| race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | TARGET | group_diag_1 | group_diag_2 | group_diag_3 | L2code_diag_1 | L2code_diag_2 | L2code_diag_3 | category_diag_1 | category_diag_2 | category_diag_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Caucasian | Female | [50-60) | ? | 2 | 1 | 1 | 8 | ? | Cardiology | 77 | 6 | 33 | 0 | 0 | 0 | 401 | 997 | 560 | 8 | None | None | Steady | No | No | No | No | No | No | Down | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | 1 | Circulatory | Injury | Digestive | c7.1 | c16.10 | c9.6 | Essential hypertension | Complications of surgical procedures or medica... | Intestinal obstruction without hernia |
| 1 | Caucasian | Female | [50-60) | ? | 3 | 1 | 1 | 2 | ? | Surgery-Neuro | 49 | 1 | 11 | 0 | 0 | 0 | 722 | 305 | 250 | 3 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 0 | Musculoskeletal | Other | Diabetes | c13.3 | c5.11 | c3.2 | Spondylosis; intervertebral disc disorders; ot... | Alcohol-related disorders | Diabetes mellitus without complication |
| 2 | Caucasian | Female | [80-90) | ? | 1 | 3 | 7 | 4 | MC | InternalMedicine | 68 | 2 | 23 | 0 | 0 | 0 | 820 | 493 | E880 | 9 | None | >7 | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | 0 | Injury | Respiratory | Other | c16.2 | c8.3 | cNA | Fracture of neck of femur (hip) | Asthma | Supplementary Classification Of External Cause... |
Task PT4-1 c)
To identify the top three CCS categories (based on the merged dataset) with the highest absolute proportions of readmissions, we stack the three diagnosis categories recorded for each readmitted patient as follows.
# Filter for readmitted patients
readmitted_df = df_pre2[df_pre2['TARGET'] == 1]
# Stack category_diag_1, category_diag_2, category_diag_3 into a single column
category_cols = [f'category_{diag}' for diag in diag_cols]
readmitted_stacked = readmitted_df[category_cols].melt(value_name='category')
# Count the frequency of each category across all diagnosis levels
target_category_counts = readmitted_stacked['category'].value_counts()
# Sort in descending order
target_category_counts = target_category_counts.sort_values(ascending=False)
# Display the top 3 most frequent categories
print("The top three categories across all three diagnosis levels:")
display(target_category_counts.reset_index().head(3).style.hide(axis='index'))
The top three categories across all three diagnosis levels:
| category | count |
|---|---|
| Diabetes mellitus without complication | 1874 |
| Congestive heart failure; nonhypertensive | 1186 |
| Coronary atherosclerosis and other heart disease | 1057 |
Otherwise, we count them by diagnosis level, namely, primary (diag_1), secondary (diag_2), and additional secondary (diag_3).
# Filter for readmitted patients
readmitted_df = df_pre2[df_pre2['TARGET'] == 1]
# Initialize dictionary to hold counts
target_category_counts = {}
# Loop over each diagnosis column
for diag in diag_cols:
# Count value frequencies for this column
counts = readmitted_df[f'category_{diag}'].value_counts()
# Sort in descending order and store in dictionary
target_category_counts[diag] = counts.sort_values(ascending=False)
# Display the top three most frequent categories
print(f"The top three categories of {diag}:")
display(target_category_counts[diag].reset_index()
.rename(columns={f'category_{diag}': 'category'}).head(3).style.hide(axis='index'))
The top three categories of diag_1:
| category | count |
|---|---|
| Congestive heart failure; nonhypertensive | 443 |
| Coronary atherosclerosis and other heart disease | 429 |
| Acute cerebrovascular disease | 324 |
The top three categories of diag_2:
| category | count |
|---|---|
| Diabetes mellitus without complication | 591 |
| Congestive heart failure; nonhypertensive | 438 |
| Fluid and electrolyte disorders | 404 |
The top three categories of diag_3:
| category | count |
|---|---|
| Diabetes mellitus without complication | 961 |
| Essential hypertension | 474 |
| Fluid and electrolyte disorders | 317 |
We retain this per-diagnosis counting approach to ensure stable embeddings for each diagnosis level.
Task PT4-1 d)
Next, for each category_diag_x, we add a new column sum_target_cg_diag_x using the result from part c). If a category is not present among readmitted patients, its value is set to zero.
# Add the required new column `sum_target_cg` for each diagnosis
for col in diag_cols:
# A dictionary storing the absolute proportion of each category in relation to readmissions
sum_target_cg_dict = target_category_counts[diag].to_dict()
df_pre2[f'sum_target_cg_{col}'] = df_pre2[f'category_{col}'].map(sum_target_cg_dict)
# Fill it with the value zero if the category is not found among readmitted patients
df_pre2[f'sum_target_cg_{col}'] = df_pre2[f'sum_target_cg_{col}'].fillna(0).astype('int64')
We now add a new column hdiag for each diag_x, assigning values based on the frequency of the corresponding diagnosis category in the dataset filtered for readmitted patients and a predefined threshold of 20. The resulting values are then trimmed to a maximum length of 30 characters.
threshold = 20
# Function to add a new column adjusting the value of each `category`
def create_hdiag(diag, category, group):
# Set the count again to zero if the category is not found among readmitted patients
count = target_category_counts[diag].get(category, 0)
# Set the value of `hdiag` to category if the category appears more than 20 times
# and the string desired otherwise
value = category
if not count > threshold:
value = f"o.{group}"
# Trim to max 30 characters
return str(value)[:30]
# Create a new column `hdiag_x` for each diagnosis `diag_x`
for diag in diag_cols:
df_pre2[f'h{diag}'] = df_pre2.apply(
lambda row: create_hdiag(diag, row[f'category_{diag}'], row[f'group_{diag}']),
axis=1)
FInally we verify the correctness of the hdiag transformation logic for a few random rows by simulating a unit test-like validation using assert statements.
# Function to verify the content of the column `hdiag` for a few random rows
def test_hdiag_logic(num_checks=3):
passed = 0
failed = 0
failed_rows = []
sample_rows = df_pre2.sample(n=num_checks, random_state=RANDOM_SEED)
for idx, row in sample_rows.iterrows():
for diag in diag_cols:
category = row[f'category_{diag}']
group = row[f'group_{diag}']
hdiag = row[f'h{diag}']
# Get value counts of the corresponding category
count = target_category_counts[diag].get(category, 0)
# expected value of `hdiag`
expected = category if count > threshold else f"o.{group}"
expected = str(expected)[:30]
try:
assert hdiag == expected
passed += 1
except AssertionError:
failed += 1
failed_rows.append({
"index": idx,
"diag": diag,
"category": category,
"group": group,
"count": count,
"hdiag": hdiag,
"expected": expected
})
print(f"✅ {passed} checks passed")
if failed:
print(f"❌ {failed} checks failed. Here are the details:")
for row in failed_rows:
print(row)
# Verify the column `diag` by checking two sample rows
test_hdiag_logic(num_checks=2)
✅ 6 checks passed
The test passed for the two sample rows across all three diagnosis categories.
Task PT4-2: [Learning objectives: 2.2, 5.1 — Points: 10]
a) After adjusting the feature category in the previous task, further relevant categorical features are to be
adjusted. Proceed as follows: If the number of feature values is smaller than the threshold of 20, then
replace
- the value of the feature
discharge_disposition_idwith99, - the value of the feature
medical_specialtywithZZ, - the value of the feature
payer_codewithZZ, - the value of the feature
admission_typewith99, - the value of the feature
agewith the range[0,20)
After making these replacements, display the updated feature values along with their frequency.
b) Create suitable visualizations for the features modified in (a) and PT4-1. Identify and analyze at least one notable pattern in greater detail.
Solution:
Task PT4-2 a)
# Define the columns to be adjusted and their replacement values
categorical_features_to_adjust = {
'discharge_disposition_id': 99,
'medical_specialty': 'ZZ',
'payer_code': 'ZZ',
'admission_type_id': 99, # Assuming 'admission_type' is 'admission_type_id'
'age': '[0-20)'
}
# Apply replacements based on frequency
for col, replacement in categorical_features_to_adjust.items():
# Count frequency of each value
value_counts = df_pre2[col].value_counts()
# Identify values below threshold
rare_values = value_counts[value_counts < threshold].index
# Replace rare values
df_pre2[col] = df_pre2[col].apply(lambda x: replacement if x in rare_values else x)
# Display updated values and frequencies
for col in categorical_features_to_adjust.keys():
print(f"\nUpdated value counts for '{col}':")
display(df_pre2[col].value_counts().reset_index().style.hide(axis='index'))
Updated value counts for 'discharge_disposition_id':
| discharge_disposition_id | count |
|---|---|
| 1 | 30258 |
| 3 | 6029 |
| 6 | 5189 |
| 18 | 1921 |
| 2 | 1101 |
| 22 | 1073 |
| 5 | 664 |
| 25 | 494 |
| 4 | 375 |
| 7 | 296 |
| 23 | 181 |
| 28 | 72 |
| 8 | 49 |
| 15 | 28 |
| 24 | 21 |
Updated value counts for 'medical_specialty':
| medical_specialty | count |
|---|---|
| ? | 22642 |
| InternalMedicine | 7341 |
| Family/GeneralPractice | 3312 |
| Emergency/Trauma | 2875 |
| Cardiology | 2811 |
| Surgery-General | 1528 |
| Orthopedics | 891 |
| Orthopedics-Reconstructive | 806 |
| Radiologist | 576 |
| ObstetricsandGynecology | 506 |
| Nephrology | 460 |
| Psychiatry | 450 |
| Pulmonology | 415 |
| Urology | 409 |
| Surgery-Cardiovascular/Thoracic | 402 |
| Surgery-Neuro | 340 |
| Gastroenterology | 246 |
| Surgery-Vascular | 241 |
| PhysicalMedicineandRehabilitation | 153 |
| Oncology | 144 |
| Pediatrics | 139 |
| Neurology | 135 |
| Pediatrics-Endocrinology | 131 |
| ZZ | 104 |
| Otolaryngology | 90 |
| Hematology/Oncology | 77 |
| Endocrinology | 68 |
| Surgery-Cardiovascular | 67 |
| Surgery-Thoracic | 66 |
| Pediatrics-CriticalCare | 52 |
| Gynecology | 46 |
| Psychology | 39 |
| Surgeon | 35 |
| Podiatry | 34 |
| Radiology | 29 |
| Ophthalmology | 25 |
| Surgery-Plastic | 23 |
| Hospitalist | 23 |
| InfectiousDiseases | 20 |
Updated value counts for 'payer_code':
| payer_code | count |
|---|---|
| ? | 20378 |
| MC | 13115 |
| HM | 2821 |
| BC | 2601 |
| SP | 2259 |
| MD | 1522 |
| CP | 1391 |
| UN | 1385 |
| CM | 913 |
| OG | 475 |
| PO | 357 |
| DM | 256 |
| WC | 96 |
| CH | 92 |
| OT | 42 |
| SI | 26 |
| MP | 22 |
Updated value counts for 'admission_type_id':
| admission_type_id | count |
|---|---|
| 1 | 24036 |
| 3 | 10056 |
| 2 | 8731 |
| 6 | 2641 |
| 5 | 2047 |
| 8 | 222 |
| 99 | 18 |
Updated value counts for 'age':
| age | count |
|---|---|
| [70-80) | 11623 |
| [60-70) | 10736 |
| [50-60) | 8615 |
| [80-90) | 7242 |
| [40-50) | 4885 |
| [30-40) | 2003 |
| [90-100) | 1282 |
| [20-30) | 844 |
| [10-20) | 392 |
| [0-10) | 129 |
Task PT4-2 b)
We will use bar plots to display relative frequencies and highlight distribution patterns, particularly among consolidated values such as 99, ZZ, as well as o.{group} in hdiag.
# Function to display relative frequencies for categorical features
def plot_relative_frequencies(features, fig_width, fig_height):
for feature in features:
plt.figure(figsize=(fig_width, fig_height))
feature_counts = df_pre2[feature].value_counts(normalize=True) * 100
sns.barplot(
x=feature_counts.index.astype(str), y=feature_counts.values,
color=COLOR_DARK
)
plt.title(f"Distribution of {feature} as Percentage of Total")
plt.ylabel("Percentage (%)")
plt.yscale('log') # Set y-axis to log base 10
plt.xlabel(feature)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Plot each categorical feature as a bar chart showing relative frequencies
plot_relative_frequencies(categorical_features_to_adjust.keys(), 10, 5)
plot_relative_frequencies([f'h{diag}' for diag in diag_cols], 18, 6)
After replacing rare values, we observe that features with only a few distinct categories and adequate frequencies, such as discharge_disposition_id, payer_code, admission_type, age, remain unchanged in their distribution.
In contrast, features like medical_specialty, as well as hdiag_1, hdiag_2, hdiag_3, undergo significant transformation: the distributions become noticeably more uniform, as many low-frequency values are consolidated into broader grouped categories (e.g., ZZ or o.{group}). Some notable patterns are:
Dominance of consolidation labels: In the
hdiag_*features, for example, the artificially introduced labelo.Otherbecomes one of the top categories. This suggests that a substantial portion of the data originally consisted of highly fragmented or rare specialties, making them unsuitable for robust statistical modeling without grouping.Smoothing of category distribution: The
hdiag_*features, for example, show a flattening of the frequency curve after grouping infrequent diagnostic categories. In other words, a more balanced representation emerges. This helps reduce model bias toward overly common diagnosis categories while still preserving signal in more general diagnostic groups.
Task PT5: Creating Stable Diagnosis Embeddings with a High-Performance Neural Network¶
Task PT5-1: [Learning objectives: 5.1 — Points: 6]
a) Perform label encoding on the categorical variables generated in Task PT4-1 and Task PT4-2. Display three random rows from the new dataset.
b) Scale the numerical features number_inpatient, num_lab_procedures, number_diagnoses and num_medications. Demonstrate the effect of the scaling.
c) To train a neural network, create a dataset consisting of the categorical features from PT4-2 a) and the numerical features from part b) of this task. Split this dataset into 70% training data, 15% validation data, and 15% test data.
Solution:
PT5-1 a)
We perform label encoding using basic Python by manually mapping each unique category to a corresponding integer. Alternatively, one can use LabelEncoder from sklearn.preprocessing.
# List of categorical columns to encode
categorical_features = list(categorical_features_to_adjust.keys()) + [f'h{diag}' for diag in diag_cols]
# Store the dataset before encoding as a backup
df_before_embeddings = df_pre2.copy()
# Perform label encoding manually (alternatively with LabelEncoder from sklearn)
for col in categorical_features:
unique_vals = df_pre2[col].astype(str).unique()
val_to_int = {val: idx for idx, val in enumerate(sorted(unique_vals))} # `sorted(unique_vals)` ensures consistent ordering.
df_pre2[col] = df_pre2[col].astype(str).map(val_to_int)
# Display three random rows
df_pre2.sample(n=3, random_state=RANDOM_SEED)
| race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | TARGET | group_diag_1 | group_diag_2 | group_diag_3 | L2code_diag_1 | L2code_diag_2 | L2code_diag_3 | category_diag_1 | category_diag_2 | category_diag_3 | sum_target_cg_diag_1 | sum_target_cg_diag_2 | sum_target_cg_diag_3 | hdiag_1 | hdiag_2 | hdiag_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47523 | Asian | Female | 5 | ? | 2 | 12 | 1 | 1 | 0 | 16 | 18 | 4 | 6 | 0 | 0 | 0 | 722 | 250 | 401 | 4 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 0 | Musculoskeletal | Diabetes | Circulatory | c13.3 | c3.2 | c7.1 | Spondylosis; intervertebral disc disorders; ot... | Diabetes mellitus without complication | Essential hypertension | 17 | 961 | 474 | 58 | 21 | 21 |
| 8582 | Caucasian | Male | 4 | ? | 1 | 12 | 4 | 8 | 0 | 0 | 45 | 6 | 53 | 0 | 0 | 0 | 410 | 414 | 401 | 7 | None | None | No | No | No | No | No | No | No | Up | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Ch | Yes | 0 | Circulatory | Circulatory | Circulatory | c7.2 | c7.2 | c7.1 | Acute myocardial infarction | Coronary atherosclerosis and other heart disease | Essential hypertension | 10 | 261 | 474 | 3 | 18 | 21 |
| 42024 | Caucasian | Female | 7 | ? | 0 | 9 | 7 | 3 | 7 | 0 | 47 | 0 | 13 | 0 | 0 | 0 | 25012 | 427 | 276 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | 0 | Diabetes | Circulatory | Other | c3.3 | c7.2 | c3.8 | Diabetes mellitus with complications | Cardiac dysrhythmias | Fluid and electrolyte disorders | 193 | 260 | 317 | 20 | 10 | 22 |
PT5-1 b)
To scale the numerical features and demonstrate the effect, we’ll use standardization (z-score normalization), which transforms each value to have:
- Mean = 0
- Standard deviation = 1
We'll then compare the original vs. scaled distributions numerically as well as visually.
# Numerical columns to scale
numerical_features = ['number_inpatient', 'num_lab_procedures', 'number_diagnoses', 'num_medications']
# Compute means and stds before scaling (for comparison)
before_scaling = df_pre2[numerical_features].describe()
# Manual standardization
for col in numerical_features:
mean = df_pre2[col].mean()
std = df_pre2[col].std()
df_pre2[col] = (df_pre2[col] - mean) / std
# Compute means and stds after scaling
after_scaling = df_pre2[numerical_features].describe()
# Show comparison side-by-side
print("Before Scaling:\n", before_scaling)
print("\nAfter Scaling:\n", after_scaling)
Before Scaling:
number_inpatient num_lab_procedures number_diagnoses num_medications
count 47751.000000 47751.000000 47751.000000 47751.000000
mean 0.138259 42.356746 7.102092 15.559716
std 0.532646 19.809284 2.046605 8.516024
min 0.000000 1.000000 1.000000 1.000000
25% 0.000000 30.000000 5.000000 10.000000
50% 0.000000 44.000000 8.000000 14.000000
75% 0.000000 56.000000 9.000000 20.000000
max 12.000000 132.000000 16.000000 81.000000
After Scaling:
number_inpatient num_lab_procedures number_diagnoses num_medications
count 4.775100e+04 4.775100e+04 4.775100e+04 4.775100e+04
mean -2.142744e-17 4.999735e-17 -1.761811e-16 -4.419409e-17
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -2.595701e-01 -2.087746e+00 -2.981568e+00 -1.709685e+00
25% -2.595701e-01 -6.237856e-01 -1.027112e+00 -6.528535e-01
50% -2.595701e-01 8.295371e-02 4.387304e-01 -1.831507e-01
75% -2.595701e-01 6.887303e-01 9.273444e-01 5.214034e-01
max 2.226948e+01 4.525315e+00 4.347643e+00 7.684370e+00
# Plot the distribution of numerical features before and after scaling
for col in numerical_features:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(df_before_embeddings[col], ax=axes[0], kde=True, color=COLOR_LIGHT)
axes[0].set_title(f"{col} - Before Scaling")
sns.histplot(df_pre2[col], ax=axes[1], kde=True, color=COLOR_DARK)
axes[1].set_title(f"{col} - After Scaling")
plt.tight_layout()
plt.show()
Note that the scaling doesn't change the structure of the distribution, but the range and magnitude of the feature values are transformed. Each numerical feature, namely number_inpatient, num_lab_procedures, number_diagnoses, and num_medications, is rescaled to have a mean of 0 and a standard deviation of 1. As mentioned in Section 6.1, this transformation ensures that all features contribute equally to distance-based algorithms (like k-NN or SVM) and improves numerical stability in gradient-based models (e.g., logistic regression, neural networks).
Importantly, the relative spacing between values and the shape of the distribution (e.g., skewness, modality) are preserved. However, for instance in the plot of number_inpatient, outliers may now appear more extreme due to the scaling of standard deviations.
Visual inspection before and after scaling confirms this shift in scale while maintaining overall distributional patterns.
PT5-1 c)
We now combine the processed categorical and numerical features into a new dataset, df2, which will be used to train a neural network with embeddings for the categorical features. The dataset is then split into 70% training, 15% validation, and 15% test data as desired.
# Create the dataset consisting of the features processed above together with the TARGET
df2 = df_pre2[categorical_features + numerical_features + ['TARGET']]
# Split the dataset into training, validation and test data
# Constants for data split ratios
TRAIN_RATIO = 0.7
VAL_TEST_RATIO = 0.5 # Splitting the remaining 30% equally into validation and test
# Separate the features (X) and the target (y)
X2 = df2.drop(columns=['TARGET'], axis=1)
y2 = df2['TARGET']
# Split the dataset into a training set and a combined validation and test set
X2_train, X2_val_test, y2_train, y2_val_test = train_test_split(X2, y2, train_size=TRAIN_RATIO, random_state=RANDOM_SEED)
# Further split the combined validation and test set into separate validation and test sets
X2_val, X2_test, y2_val, y2_test = train_test_split(X2_val_test, y2_val_test, test_size=VAL_TEST_RATIO, random_state=RANDOM_SEED)
# Verify the dimensions of the training, validation, and test sets by displaying (rows, columns)
print(f"Training set dimensions (rows, columns): {X2_train.shape}")
print(f"Validation set dimensions (rows, columns): {X2_val.shape}")
print(f"Test set dimensions (rows, columns): {X2_test.shape}")
Training set dimensions (rows, columns): (33425, 12) Validation set dimensions (rows, columns): (7163, 12) Test set dimensions (rows, columns): (7163, 12)
Task PT5-2: [Learning objectives: 5.1 — Points: 20]
a) Build a neural network to generate embeddings for the categorical features. The following should be considered:
- The input to the network is the dataset from PT5-1 c).
- An embedding with two dimensions should be generated for each categorical feature.
- The number of hidden layers can be chosen freely. The final choice must be justified.
b) Fit the network defined in part a). Plot the loss and validation loss over the training epochs. Comment on the progression. Then, calculate the validation loss.
c) To assess the stability of the calculated embeddings in the next section, clone the network from a) and fit it again. Calculate and output the validation loss.
d) Extract the embeddings from both networks and plot them side by side in pairs. What anomalies or patterns can be observed? What can be said about the stability of the embeddings?
Solution:
To accomplish this task, we use a complete pipeline built with TensorFlow/Keras, constructing a neural network that generates 2D embeddings for the categorical features and trains on our classification task of predicting readmission.
PT5-2 a)
To determine the optimal architecture for the hidden layers, we tune the artificial neural network by building several feed-forward models with one to three hidden layers, each containing between five and thirty neurons. Additionally, a dropout layer with varying dropout rates is applied after each dense layer to help reduce overfitting. These models are trained using different batch sizes and numbers of epochs, with the Adam optimizer facilitating the optimization process.
We first define a function that builds and compiles the neural network architecture based on hyperparameters:
def build_model(layers, dropout_rate, optimizer):
# Define input layers and embedding layers
inputs = {}
embeddings = {}
embedding_dim = 2 # 2D embeddings
for col in categorical_features:
# Number of unique feature values
num_unique = df2[col].nunique()
# Input layer
inputs[col] = Input(shape=(1,), name=f"{col}_input", dtype='int64')
# Embedding layer
embeddings[col] = Embedding(input_dim=num_unique, output_dim=embedding_dim, name=f"{col}_embedding")(inputs[col])
embeddings[col] = Flatten(name=f"{col}_flatten")(embeddings[col])
# Add input for numerical features
inputs['numerical'] = Input(shape=(len(numerical_features),), name="numerical_input")
# Concatenate all features: categorical embeddings + numerical input
all_features = list(embeddings.values()) + [inputs['numerical']]
x = Concatenate(name="concat_layer")(all_features)
if isinstance(layers, int): # 1 hidden layer
x = Dense(layers, activation='relu', name=f'dense')(x)
x = Dropout(dropout_rate, name=f'dropout')(x)
else:
for i, nodes in enumerate(layers):
x = Dense(nodes, activation='relu', name=f'dense_{i+1}')(x)
x = Dropout(dropout_rate, name=f'dropout_{i+1}')(x)
# Output layer
output = Dense(1, activation='sigmoid', name='output')(x)
# Define and compile our neural network model
ANN = Model(inputs=list(inputs.values()), outputs=output)
ANN.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['auc'])
return ANN
We prepare then the data for the model input:
# Categorical features
X2_train_cat = [X2_train[col].values for col in categorical_features]
X2_val_cat = [X2_val[col].values for col in categorical_features]
X2_test_cat = [X2_test[col].values for col in categorical_features]
# Numerical features (assumed already scaled)
X2_train_num = X2_train[numerical_features].values
X2_val_num = X2_val[numerical_features].values
X2_test_num = X2_test[numerical_features].values
# Combine and split
X2_train_ANN = X2_train_cat + [X2_train_num]
X2_val_ANN = X2_val_cat + [X2_val_num]
X2_test_ANN = X2_test_cat + [X2_test_num]
We then proceed in the typical scikit-learn style, performing hyperparameter optimization using randomized search and holdout validation. Note that the set of possible layer structures defined below is designed so that the initial layers capture complex feature interactions, while the subsequent layers, with fewer units, serve to compress the information and provide regularization.
# Hyperparameter search range
param_grid = {
'layers': [(10), (20), (30), (10, 5), (20, 5), (30, 5), (20, 10), (30, 10), (30,20,5), (30,20,10)],
'dropout_rate': [0.1, 0.3, 0.5],
'optimizer': ['adam'],
'batch_size': [32, 64],
'epochs': [10, 20]
}
# Define EarlyStopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor='val_auc', # Watch validation auc
patience=5, # Number of epochs with no improvement after which training will be stopped
restore_best_weights=True # Restore model weights from the epoch with the best value of the monitored quantity
)
# Artificial neural network: Hyperparameter optimization with RandomizedSearchCV
tic = time.time()
results = []
# Ensure reproducibility
set_random_seed(RANDOM_SEED)
# Start the hyperparameter tuning
for i in range(20): # Try 20 random configurations
params = {
k: random.choice(v)
for k, v in param_grid.items()
}
# print(f"Trial {i+1}: {params}")
ANN = build_model(params['layers'], params['dropout_rate'], params['optimizer'])
# Fit the neural network defined in PT5-2 a)
history = ANN.fit(
X2_train_ANN, y2_train,
validation_data=(X2_val_ANN, y2_val),
epochs=params['epochs'],
callbacks=[early_stopping],
batch_size=params['batch_size'],
verbose=0
)
val_loss = min(history.history["val_loss"])
val_auc = max(history.history["val_auc"])
results.append((params, val_loss, val_auc))
# Sort by best validation auc
results.sort(key=lambda x: x[2], reverse=True)
# Output the runtime and best parameters (as well as the runner-ups)
print("Elapsed time (sec):" + "%6.0f" % (time.time() - tic))
print("Best: %f using %s" % (results[0][2], results[0][0]))
result_df = pd.DataFrame(results, columns=['params', 'val_loss', 'val_auc'])
pd.concat([result_df['params'].apply(pd.Series), result_df[['val_loss', 'val_auc']]], axis=1).head(12)
Elapsed time (sec): 372
Best: 0.692090 using {'layers': (30, 10), 'dropout_rate': 0.3, 'optimizer': 'adam', 'batch_size': 32, 'epochs': 20}
| layers | dropout_rate | optimizer | batch_size | epochs | val_loss | val_auc | |
|---|---|---|---|---|---|---|---|
| 0 | (30, 10) | 0.3 | adam | 32 | 20 | 0.359087 | 0.692090 |
| 1 | (30, 10) | 0.1 | adam | 64 | 10 | 0.358597 | 0.689873 |
| 2 | 10 | 0.1 | adam | 32 | 10 | 0.358947 | 0.688083 |
| 3 | (20, 5) | 0.3 | adam | 64 | 20 | 0.361231 | 0.687680 |
| 4 | (30, 20, 5) | 0.1 | adam | 32 | 10 | 0.360035 | 0.685874 |
| 5 | (30, 20, 10) | 0.3 | adam | 64 | 10 | 0.361238 | 0.685116 |
| 6 | 10 | 0.5 | adam | 64 | 10 | 0.361394 | 0.684720 |
| 7 | 10 | 0.1 | adam | 32 | 20 | 0.360505 | 0.684214 |
| 8 | 30 | 0.5 | adam | 64 | 20 | 0.360139 | 0.684154 |
| 9 | 10 | 0.5 | adam | 32 | 20 | 0.361033 | 0.684061 |
| 10 | (20, 10) | 0.1 | adam | 64 | 10 | 0.360023 | 0.683964 |
| 11 | (30, 20, 10) | 0.1 | adam | 64 | 10 | 0.359962 | 0.683775 |
The best result was achieved with an architecture consisting of two hidden layers with 30 and 10 units, respectively, along with dropout layers using a dropout rate of 0.3. However, since the simplest architecture that achieved one of the best AUC scores consists of a single hidden layer with 10 units, we select it as the optimal trade-off between model complexity and training performance.
PT5-2 b)
We fit the network now with the best hyperparameters found in part a).
# Artificial neural network: Create a new model on all folds using the best parameters
start_time = time.time()
# Reset the seed after neural network tuning to ensure reproducibility
set_random_seed(RANDOM_SEED)
# Build the neural network model with the best hyperparameters
ANN = build_model((10), 0.1, 'adam')
# Fit the model
ANN_history = ANN.fit(
X2_train_ANN, y2_train,
validation_data=(X2_val_ANN, y_val),
epochs=10,
callbacks=[early_stopping],
batch_size=32,
verbose=1
)
# Calculate and print the running time in seconds
elapsed_time_ANN = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_ANN:.2f}")
Epoch 1/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 6s 3ms/step - auc: 0.5288 - loss: 0.5071 - val_auc: 0.6794 - val_loss: 0.3631 Epoch 2/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - auc: 0.6379 - loss: 0.3789 - val_auc: 0.6843 - val_loss: 0.3608 Epoch 3/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - auc: 0.6598 - loss: 0.3717 - val_auc: 0.6831 - val_loss: 0.3598 Epoch 4/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - auc: 0.6680 - loss: 0.3687 - val_auc: 0.6803 - val_loss: 0.3601 Epoch 5/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - auc: 0.6775 - loss: 0.3656 - val_auc: 0.6801 - val_loss: 0.3600 Epoch 6/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - auc: 0.6795 - loss: 0.3651 - val_auc: 0.6798 - val_loss: 0.3603 Epoch 7/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - auc: 0.6832 - loss: 0.3637 - val_auc: 0.6782 - val_loss: 0.3602 Elapsed time (sec): 20.94
# Plot training and validation loss
plt.plot(ANN_history.history['loss'], label='Training Loss', color=COLOR_LIGHT)
plt.plot(ANN_history.history['val_loss'], label='Validation Loss', color=COLOR_DARK)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training vs Validation Loss')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
From the progression plot, we observe the following:
The training loss drops rapidly during the early epochs (0–2), indicating that the model quickly learns patterns in the training data. It then continues to decline steadily, though at a slower rate. In contrast, the validation loss decreases only slightly at the beginning, remaining significantly lower than the training loss, but starts to increase steadily from around epoch 2 onward.
The widening gap between training and validation loss suggests that the model is beginning to overfit the training data. Since the validation loss stops improving after epochs 2–3, training was halted after five additional epochs to prevent further overfitting and to conserve computational resources.
We calculate and output now the validation loss as well as the validation auc.
# Calculate and display the model validation loss
val_loss = ANN.evaluate(X2_val_ANN, y2_val, verbose=0)[0]
print(f"Final Validation Loss: {val_loss:.4f}")
Final Validation Loss: 0.3608
# Calculate and display the model validation AUC
auc_ANN = ANN.evaluate(X2_val_ANN, y2_val, verbose=0)[1]
print('The validation AUC of the neural network model with tuned hyperparameters is: {:.6f}'.format(auc_ANN))
The validation AUC of the neural network model with tuned hyperparameters is: 0.684344
PT5-2 c)
To assess the stability of the embeddings, we clone the same network architecture and compare the validation loss and learned embeddings from both runs.
# Clone the netork from PT5-2 a)
ANN_clone = tf.keras.models.clone_model(ANN)
ANN_clone.compile(optimizer='adam', loss='binary_crossentropy', metrics=['auc']) # recompile
# Fit the network once more
ANN_history_clone = ANN_clone.fit(
X2_train_ANN, y2_train,
validation_data=(X2_val_ANN, y2_val),
epochs=10,
callbacks=[early_stopping],
batch_size=32,
verbose=1
)
# Calculate and output the validation loss
val_loss_clone = ANN.evaluate(X2_val_ANN, y2_val, verbose=0)[0]
print(f"Final Validation Loss: {val_loss_clone:.4f}")
Epoch 1/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 5s 3ms/step - auc: 0.5361 - loss: 0.4957 - val_auc: 0.6773 - val_loss: 0.3632 Epoch 2/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - auc: 0.6450 - loss: 0.3755 - val_auc: 0.6817 - val_loss: 0.3614 Epoch 3/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - auc: 0.6644 - loss: 0.3700 - val_auc: 0.6817 - val_loss: 0.3605 Epoch 4/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - auc: 0.6747 - loss: 0.3667 - val_auc: 0.6797 - val_loss: 0.3602 Epoch 5/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - auc: 0.6821 - loss: 0.3644 - val_auc: 0.6785 - val_loss: 0.3604 Epoch 6/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - auc: 0.6868 - loss: 0.3628 - val_auc: 0.6771 - val_loss: 0.3608 Epoch 7/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - auc: 0.6847 - loss: 0.3630 - val_auc: 0.6759 - val_loss: 0.3611 Epoch 8/10 1045/1045 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - auc: 0.6896 - loss: 0.3616 - val_auc: 0.6754 - val_loss: 0.3609 Final Validation Loss: 0.3608
PT5-2 d)
# Function to extract embedding weights
def get_embedding_weights(model, feature):
layer = model.get_layer(f"{feature}_embedding")
return layer.get_weights()[0]
# Prepare plot grid
n = len(categorical_features)
fig, axes = plt.subplots(n, 2, figsize=(12, n * 4))
fig.suptitle("Embedding Comparison per Feature", fontsize=16, y=1.02)
# Loop through features and plot
for i, feature in enumerate(categorical_features):
emb1 = get_embedding_weights(ANN, feature)
emb2 = get_embedding_weights(ANN_clone, feature)
# Original model
axes[i, 0].scatter(emb1[:, 0], emb1[:, 1], c=COLOR_LIGHT)
axes[i, 0].set_title(f"{feature} - Original")
axes[i, 0].grid(True)
# Cloned model
axes[i, 1].scatter(emb2[:, 0], emb2[:, 1], c=COLOR_DARK)
axes[i, 1].set_title(f"{feature} - Cloned")
axes[i, 1].grid(True)
plt.tight_layout()
plt.show()
Key observations and insights
Preserved global structure: Most features, such as
discharge_disposition_idandmedical_specialty, show similar spatial arrangements between the original and cloned models. This indicates strong global consistency in how categories with sufficient data support are embedded. The embedding learning process appears to generalize well and is robust to moderate changes.Shifted or rotated embeddings: Embeddings like those for
hiag_2andageappear rotated or flipped in the cloned model. This is expected, as embeddings are inherently invariant to affine transformations such as rotation, reflection, or translation. What matters is the relative positioning of the categories, not their absolute coordinates.Differences in density and scale: Features such as
hdiag_1andhdiag_3) show noticeable variation in spread and compactness between the original and cloned versions. These local differences suggest sensitivity to model initialization and learning dynamics, possibly amplified by the high cardinality or complexity of the categories.Random dispersion in sparse categories: For features with low-frequency categories (e.g.,
payer_codeandmedical_specialty), one of the embedding sets (typically the cloned version) exhibits more random scatter. This points to instability in sparsely represented categories, which may not be consistently learned across training runs.
Conclusions on embedding stability
- Moderate stability: High-frequency categories exhibit greater embedding stability, with well-preserved relative structures.
- Sensitivity to initialization: Low-frequency or underrepresented categories are more prone to variability due to random weight initialization and limited training signals.
- Interpretability caution: Embedding coordinates should not be interpreted in isolation; their meaning lies in relative distances and spatial relationships.
Task PT6: Joining Embeddings and Using Them in Modeling [Learning objectives: 5.1 — Points: 15]¶
The goal is to investigate how using the embeddings generated in PT5 (instead of the original categorical variables) impacts the modeling results (i.e., the AUC score).
a) Extend the dataset from Task PT5-1 a) by adding the corresponding embedding results. Remove any unne- cessary features from the dataset. Then, split the dataset into training (70%), validation (15%), and test (15%) sets.
b) Based on the dataset from a): Perform logistic regression considering the embedding features. Compare the model in Section 2.2 for guidance. What can be concluded in comparison to it?
c) Similar to b): Apply CatBoost considering the embedding features. What can be concluded in comparison to the result from Section 1.5 (baseline model)?
d) Finally, evaluate the impact of the embeddings in the models considered. Use a bar chart that displays the previous models (i.e., especially the first calculated CatBoost model, logistic regression, as well as models from one-hot encoding and subsampling) with the validation results. The evaluation should include statements on the usefulness of embeddings and your personal recommendation.
Solution:
PT6 a)
# Loop through features and extend dataset with its embedding vectors
for i, feature in enumerate(categorical_features):
# Get embedding weights and map each category to its embedding
emb_weights = get_embedding_weights(ANN, feature)
embeddings = {label: emb_weights[label] for label in range(emb_weights.shape[0])}
# Replace the categorical column with its embedding columns
embedding_df = df2[feature].map(embeddings)
embedding_df = pd.DataFrame(embedding_df.tolist(),
index=df2.index,
columns=[f"{feature}_emb_{i}" for i in range(emb_weights.shape[1])])
# Join embeddings to the original dataset
df2 = pd.concat([df2, embedding_df], axis=1)
# Drop the original categorical columns if now embedded
df2.drop(columns=[feature], inplace=True)
# Split the dataset into training, validation and test data
# Constants for data split ratios
TRAIN_RATIO = 0.7
VAL_TEST_RATIO = 0.5 # Splitting the remaining 30% equally into validation and test
# Separate the features (X) and the target (y)
X2 = df2.drop(columns=['TARGET'], axis=1)
y2 = df2['TARGET']
# Split the dataset into a training set and a combined validation and test set
X2_train, X2_val_test, y2_train, y2_val_test = train_test_split(X2, y2, train_size=TRAIN_RATIO, random_state=RANDOM_SEED)
# Further split the combined validation and test set into separate validation and test sets
X2_val, X2_test, y2_val, y2_test = train_test_split(X2_val_test, y2_val_test, test_size=VAL_TEST_RATIO, random_state=RANDOM_SEED)
# Verify the dimensions of the training, validation, and test sets by displaying (rows, columns)
print(f"Training set dimensions (rows, columns): {X2_train.shape}")
print(f"Validation set dimensions (rows, columns): {X2_val.shape}")
print(f"Test set dimensions (rows, columns): {X2_test.shape}")
Training set dimensions (rows, columns): (33425, 20) Validation set dimensions (rows, columns): (7163, 20) Test set dimensions (rows, columns): (7163, 20)
PT6 b)
We perform now logistic regression considering the embedding features (without hdiag_2 and hdiag_3) and compare it to the result with the original categorical features in Section 2.2.
# Combine features and target variable into a single dataframe for formula-based modeling
Xy2_train = X2_train.copy()
Xy2_val = X2_val.copy()
Xy2_train['TARGET'] = y2_train
Xy2_val['TARGET'] = y2_val
# Start timer to calculate the running time of training the logistic regression model
start_time = time.time()
# Define the logistic regression model using statsmodels' formula API and fit it to the training data
predictors = "+".join([col for col in X2_train.columns if not re.search(r"hdiag_2|hdiag_3", col)])
LR2 = smf.logit(formula=f"TARGET ~ {predictors}", data=Xy2_train).fit()
# Calculate and print the running time in seconds
elapsed_time_LR2 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_LR2:.2f}")
# Display a summary of the logistic regression model results
print(LR2.summary())
Optimization terminated successfully.
Current function value: 0.362905
Iterations 7
Elapsed time (sec): 0.12
Logit Regression Results
==============================================================================
Dep. Variable: TARGET No. Observations: 33425
Model: Logit Df Residuals: 33408
Method: MLE Df Model: 16
Date: Tue, 13 May 2025 Pseudo R-squ.: 0.06717
Time: 11:19:43 Log-Likelihood: -12130.
converged: True LL-Null: -13004.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
Intercept -0.4283 0.114 -3.768 0.000 -0.651 -0.206
number_inpatient 0.2863 0.014 21.147 0.000 0.260 0.313
num_lab_procedures 0.0295 0.019 1.565 0.118 -0.007 0.067
number_diagnoses 0.1246 0.019 6.530 0.000 0.087 0.162
num_medications 0.0424 0.018 2.333 0.020 0.007 0.078
discharge_disposition_id_emb_0 0.8267 0.355 2.327 0.020 0.130 1.523
discharge_disposition_id_emb_1 0.6854 0.326 2.100 0.036 0.046 1.325
medical_specialty_emb_0 0.4651 0.484 0.960 0.337 -0.484 1.415
medical_specialty_emb_1 2.8357 0.495 5.725 0.000 1.865 3.806
payer_code_emb_0 -0.1563 0.476 -0.329 0.742 -1.088 0.776
payer_code_emb_1 -1.1815 0.412 -2.868 0.004 -1.989 -0.374
admission_type_id_emb_0 -0.3465 0.317 -1.095 0.274 -0.967 0.274
admission_type_id_emb_1 -0.5162 0.320 -1.613 0.107 -1.143 0.111
age_emb_0 0.3768 0.295 1.278 0.201 -0.201 0.955
age_emb_1 0.5991 0.307 1.949 0.051 -0.003 1.201
hdiag_1_emb_0 -1.3968 0.438 -3.188 0.001 -2.256 -0.538
hdiag_1_emb_1 -1.0984 0.346 -3.171 0.002 -1.777 -0.420
==================================================================================================
# Evaluate the performance of the logistic regression model on the validation data
# and calculate the Area Under the Curve (AUC) for the ROC
auc_LR2 = roc_auc_score(y2_val, LR2.predict(X2_val))
# Print the validation AUC
print(f'The validation AUC of the logistic regression model is: {auc_LR2:.6f}')
The validation AUC of the logistic regression model is: 0.677810
Conclusion: Compared with the baseline model from Section 1.5, we observe that the embedding-based model offers several advantages:
- A higher pseudo R² (0.06717 vs. 0.05995), indicating a better model fit;
- A higher log-likelihood (−12,130 vs. −12,224), reflecting lower training loss;
- A slightly lower validation AUC, which may indicate mild overfitting or random variation; and more importantly,
- Faster runtime and fewer features, pointing to greater efficiency.
Additionally, when comparing shared numerical predictors, we note that while num_medications is not statistically significant at the 5% level in the baseline model, it is num_lab_procedures that lacks significance in the embedding-based model.
In summary, the embedding-based model outperforms the baseline in nearly every aspect. However, its primary drawback lies in interpretability; for example, we cannot directly determine which discharge disposition or age bracket is captured by emb_0 or emb_1. Additionally, the stability of the learned embedding weights poses another concern. Addressing these issues would significantly enhance the model's practical applicability.
PT6 c)
We proceed by applying the CatBoost model to our prepared dataset with the embedding features, again utilizing the default settings for a straightforward evaluation without hyperparameter tuning.
# Start timer to calculate the running time of training the CatBoost model
start_time = time.time()
# Fit the CatBoostClassifier on the training data
CB7 = CatBoostClassifier(eval_metric='AUC', random_seed=RANDOM_SEED)
CB7.fit(X2_train, y2_train, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB7 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB7:.2f}")
Elapsed time (sec): 15.83
# Calculate the AUC score using the model's prediction probabilities for the positive class
auc_CB7 = roc_auc_score(y2_val, CB7.predict_proba(X2_val)[:, 1])
# Print the validation AUC
print(f'The validation AUC of the CatBoost model with standard hyperparameters is: {auc_CB7:.6f}')
The validation AUC of the CatBoost model with standard hyperparameters is: 0.673485
While the training time of the embedding-based CatBoost model is significantly shorter than that of both the baseline model from Section 1.5 and the model using encoded categorical features from Section 6.2, its predictive performance is noticeably lower. One possible explanation is the exclusion of certain features during the embedding creation process that were previously included. To ensure a fair comparison with the baseline model, we therefore include these omitted features as well.
# Loop through features and extend dataset with its embedding vectors
for i, feature in enumerate(categorical_features):
# Get embedding weights and map each category to its embedding
emb_weights = get_embedding_weights(ANN, feature)
embeddings = {label: emb_weights[label] for label in range(emb_weights.shape[0])}
# Replace the categorical column with its embedding columns
embedding_df = df_pre2[feature].map(embeddings)
embedding_df = pd.DataFrame(embedding_df.tolist(),
index=df_pre2.index,
columns=[f"{feature}_emb_{i}" for i in range(emb_weights.shape[1])])
# Join embeddings to the original dataset
df_pre2 = pd.concat([df_pre2, embedding_df], axis=1)
# Drop the original categorical columns if now embedded
df_pre2.drop(columns=[feature], inplace=True)
# Split the dataset into training, validation and test data
# Constants for data split ratios
TRAIN_RATIO = 0.7
VAL_TEST_RATIO = 0.5 # Splitting the remaining 30% equally into validation and test
# Separate the features (X) and the target (y)
X2_all = df_pre2.drop(columns=['TARGET'], axis=1)
y2 = df_pre2['TARGET']
# Split the dataset into a training set and a combined validation and test set
X2_all_train, X2_all_val_test, y2_train, y2_val_test = train_test_split(X2_all, y2, train_size=TRAIN_RATIO, random_state=RANDOM_SEED)
# Further split the combined validation and test set into separate validation and test sets
X2_all_val, X2_all_test, y2_val, y2_test = train_test_split(X2_all_val_test, y2_val_test, test_size=VAL_TEST_RATIO, random_state=RANDOM_SEED)
# Verify the dimensions of the training, validation, and test sets by displaying (rows, columns)
print(f"Training set dimensions (rows, columns): {X2_all_train.shape}")
print(f"Validation set dimensions (rows, columns): {X2_all_val.shape}")
print(f"Test set dimensions (rows, columns): {X2_all_test.shape}")
Training set dimensions (rows, columns): (33425, 70) Validation set dimensions (rows, columns): (7163, 70) Test set dimensions (rows, columns): (7163, 70)
# Start timer to calculate the running time of training the CatBoost model
start_time = time.time()
# Identify categorical features by data type 'object'
cat_features = list(df_pre2.select_dtypes(include=['object']).columns)
# Fit the CatBoostClassifier on the training data
CB8 = CatBoostClassifier(eval_metric='AUC', cat_features=cat_features, random_seed=RANDOM_SEED)
CB8.fit(X2_all_train, y2_train, logging_level='Silent')
# Calculate and print the running time in seconds
elapsed_time_CB8 = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_CB8:.2f}")
Elapsed time (sec): 170.29
# Calculate the AUC score using the model's prediction probabilities for the positive class
auc_CB8 = roc_auc_score(y2_val, CB8.predict_proba(X2_all_val)[:, 1])
# Print the validation AUC
print(f'The validation AUC of the CatBoost model with standard hyperparameters is: {auc_CB8:.6f}')
The validation AUC of the CatBoost model with standard hyperparameters is: 0.701283
As verified above, however, the model based on the dataset that reintroduces the previously omitted features, namely those not considered in Tasks PT4 and PT5, still fails to deliver improved performance, despite requiring a longer training time. This reinforces the point that enhancing the interpretability and stability of embeddings remains crucial for improving the model’s practical utility.
PT6 d)
# Evaluate the impact of the embeddings in the models considered
# Store name and AUC of prevoisly fitted models
mname.extend(["CB6_tuned", "LGB_tuned", "XGB_tuned",
"ANN_tuned", "LR2_with_embeddings",
"CB7_with_embeddings", "CB8_quick_with_embeddings"])
mauc.extend([auc_CB6, auc_LGB, auc_XGB,
auc_ANN, auc_LR2, auc_CB7, auc_CB8])
mtrainingtime.extend([elapsed_time_CB6, elapsed_time_LGB, elapsed_time_XGB,
elapsed_time_ANN, elapsed_time_LR2,
elapsed_time_CB7, elapsed_time_CB8])
plot_model_performance(dict, 0.66, 0.72, "Validation")
Evaluation of Embeddings in the Models:
Performance (AUC) Embedding-based models (LR2_with_embeddings, CB8_quick_with_embeddings) exhibit slightly lower validation AUC scores compared to their non-embedding counterparts (LR1_select, CB1_quick, CB4_OHE, etc.). The best-performing model remains CB1_quick, the CatBoost variant with minimal feature preprocessing, although it also requires much longer training time.
Training Time Excluding CB8_quick_with_embeddings, where a notable portion of runtime is spent on categorical feature encoding, embedding-based models demonstrate faster training times. Both CB7_with_embeddings and LR2_with_embeddings are among the quickest to train. In contrast, CB1_quick and XGB_tuned demand substantially longer runtimes, a critical consideration for large-scale or time-sensitive applications.
Usefulness of Embeddings:
Pros:
- Enable better scalability for high-cardinality categorical features compared to one-hot encoding.
- Significantly reduce training time, which is beneficial for rapid prototyping or resource-constrained environments.
Cons:
- Slightly reduced predictive performance (lower AUC), potentially due to reduced granularity or challenges in training optimal embeddings;
- Lack of interpretability: Embeddings are difficult to interpret, posing challenges in regulated or high-stakes domains (e.g. healthcare in our case).
Recommendation:
If minimizing training time or improving scalability is the primary objective, embedding-based models offer a compelling solution with only marginal performance trade-offs. However, when predictive accuracy and interpretability are paramount, models using one-hot encoding, augmented with thoughtful feature engineering and tuning, remain the preferred choice.
In summary, embeddings provide valuable benefits in terms of efficiency but should be applied judiciously, particularly when performance or interpretability is a critical concern.
Task PT7: AutoML, Model Evaluation, and Application [Learning objectives: 5.1 — Points: 16]¶
a) Using AutoML or any other model or model ensemble not used so far (e.g., Stacking, Blending), try to improve the AUC score. An example of AutoML is AutoGluon, see https://auto.gluon.ai/stable/index.html. As before, work with training, validation, and test sets, and evaluate the model performance.
b) At the end of the notebook, Section 11 should be fully adjusted, and everything following it (Section 12,
Appendix) should be deleted. In 11.1, evaluate all models created or optimized in the previous sections (excluding individual Search-CV models). In 11.2, at least six models should be selected for the Lift Chart, and an appropriate model should be chosen for the subsequent probability and percentile analyses.
Solution:
To improve the AUC score using an approach not previously explored, we employ AutoGluon, a powerful AutoML library, to train and optimize models, potentially leveraging ensemble techniques such as stacking. As input, we use the already encoded, scaled, and subsampled dataset, excluding embeddings for reasons outlined in the previous task.
# Install AutoGluon and import required module
!pip install autogluon
from autogluon.tabular import TabularPredictor
Collecting autogluon Using cached autogluon-1.3.0-py3-none-any.whl.metadata (11 kB) Collecting autogluon.core==1.3.0 (from autogluon.core[all]==1.3.0->autogluon) Using cached autogluon.core-1.3.0-py3-none-any.whl.metadata (12 kB) Collecting autogluon.features==1.3.0 (from autogluon) Using cached autogluon.features-1.3.0-py3-none-any.whl.metadata (11 kB) Collecting autogluon.tabular==1.3.0 (from autogluon.tabular[all]==1.3.0->autogluon) Using cached autogluon.tabular-1.3.0-py3-none-any.whl.metadata (14 kB) Collecting autogluon.multimodal==1.3.0 (from autogluon) Using cached autogluon.multimodal-1.3.0-py3-none-any.whl.metadata (13 kB) Collecting autogluon.timeseries==1.3.0 (from autogluon.timeseries[all]==1.3.0->autogluon) Using cached autogluon.timeseries-1.3.0-py3-none-any.whl.metadata (12 kB) Requirement already satisfied: numpy<2.3.0,>=1.25.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (1.26.4) Requirement already satisfied: scipy<1.16,>=1.5.4 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (1.14.1) Requirement already satisfied: scikit-learn<1.7.0,>=1.4.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (1.5.2) Collecting networkx<4,>=3.0 (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB) Requirement already satisfied: pandas<2.3.0,>=2.0.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2.2.3) Collecting tqdm<5,>=4.38 (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB) Requirement already satisfied: requests in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2.32.3) Requirement already satisfied: matplotlib<3.11,>=3.7.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (3.10.0) Collecting boto3<2,>=1.10 (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached boto3-1.38.13-py3-none-any.whl.metadata (6.6 kB) Collecting autogluon.common==1.3.0 (from autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached autogluon.common-1.3.0-py3-none-any.whl.metadata (11 kB) Collecting hyperopt<0.2.8,>=0.2.7 (from autogluon.core[all]==1.3.0->autogluon) Using cached hyperopt-0.2.7-py2.py3-none-any.whl.metadata (1.7 kB) Collecting ray<2.45,>=2.10.0 (from ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached ray-2.44.1-cp310-cp310-win_amd64.whl.metadata (20 kB) Collecting pyarrow>=15.0.0 (from autogluon.core[all]==1.3.0->autogluon) Using cached pyarrow-20.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB) Requirement already satisfied: Pillow<12,>=10.0.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.multimodal==1.3.0->autogluon) (11.0.0) Collecting torch<2.7,>=2.2 (from autogluon.multimodal==1.3.0->autogluon) Using cached torch-2.6.0-cp310-cp310-win_amd64.whl.metadata (28 kB) Collecting lightning<2.7,>=2.2 (from autogluon.multimodal==1.3.0->autogluon) Using cached lightning-2.5.1.post0-py3-none-any.whl.metadata (39 kB) Collecting transformers<4.50,>=4.38.0 (from transformers[sentencepiece]<4.50,>=4.38.0->autogluon.multimodal==1.3.0->autogluon) Using cached transformers-4.49.0-py3-none-any.whl.metadata (44 kB) Collecting accelerate<2.0,>=0.34.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached accelerate-1.6.0-py3-none-any.whl.metadata (19 kB) Collecting jsonschema<4.24,>=4.18 (from autogluon.multimodal==1.3.0->autogluon) Using cached jsonschema-4.23.0-py3-none-any.whl.metadata (7.9 kB) Collecting seqeval<1.3.0,>=1.2.2 (from autogluon.multimodal==1.3.0->autogluon) Using cached seqeval-1.2.2-py3-none-any.whl Collecting evaluate<0.5.0,>=0.4.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB) Collecting timm<1.0.7,>=0.9.5 (from autogluon.multimodal==1.3.0->autogluon) Using cached timm-1.0.3-py3-none-any.whl.metadata (43 kB) Collecting torchvision<0.22.0,>=0.16.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached torchvision-0.21.0-cp310-cp310-win_amd64.whl.metadata (6.3 kB) Collecting scikit-image<0.26.0,>=0.19.1 (from autogluon.multimodal==1.3.0->autogluon) Using cached scikit_image-0.25.2-cp310-cp310-win_amd64.whl.metadata (14 kB) Collecting text-unidecode<1.4,>=1.3 (from autogluon.multimodal==1.3.0->autogluon) Using cached text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB) Collecting torchmetrics<1.8,>=1.2.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached torchmetrics-1.7.1-py3-none-any.whl.metadata (21 kB) Collecting omegaconf<2.4.0,>=2.1.1 (from autogluon.multimodal==1.3.0->autogluon) Using cached omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB) Collecting pytorch-metric-learning<2.9,>=1.3.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached pytorch_metric_learning-2.8.1-py3-none-any.whl.metadata (18 kB) Collecting nlpaug<1.2.0,>=1.1.10 (from autogluon.multimodal==1.3.0->autogluon) Using cached nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB) Collecting nltk<4.0,>=3.4.5 (from autogluon.multimodal==1.3.0->autogluon) Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB) Collecting openmim<0.4.0,>=0.3.7 (from autogluon.multimodal==1.3.0->autogluon) Using cached openmim-0.3.9-py2.py3-none-any.whl.metadata (16 kB) Collecting defusedxml<0.7.2,>=0.7.1 (from autogluon.multimodal==1.3.0->autogluon) Using cached defusedxml-0.7.1-py2.py3-none-any.whl.metadata (32 kB) Requirement already satisfied: jinja2<3.2,>=3.0.3 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.multimodal==1.3.0->autogluon) (3.1.5) Requirement already satisfied: tensorboard<3,>=2.9 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.multimodal==1.3.0->autogluon) (2.18.0) Collecting pytesseract<0.4,>=0.3.9 (from autogluon.multimodal==1.3.0->autogluon) Using cached pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB) Collecting nvidia-ml-py3<8.0,>=7.352.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached nvidia_ml_py3-7.352.0-py3-none-any.whl Collecting pdf2image<1.19,>=1.17.0 (from autogluon.multimodal==1.3.0->autogluon) Using cached pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB) Requirement already satisfied: xgboost<3.1,>=2.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.tabular[all]==1.3.0->autogluon) (2.1.3) Collecting fastai<2.9,>=2.3.1 (from autogluon.tabular[all]==1.3.0->autogluon) Using cached fastai-2.8.1-py3-none-any.whl.metadata (9.5 kB) Requirement already satisfied: catboost<1.3,>=1.2 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.tabular[all]==1.3.0->autogluon) (1.2.7) Collecting huggingface-hub[torch] (from autogluon.tabular[all]==1.3.0->autogluon) Using cached huggingface_hub-0.31.1-py3-none-any.whl.metadata (13 kB) Collecting spacy<3.9 (from autogluon.tabular[all]==1.3.0->autogluon) Using cached spacy-3.8.5-cp310-cp310-win_amd64.whl.metadata (28 kB) Collecting einops<0.9,>=0.7 (from autogluon.tabular[all]==1.3.0->autogluon) Using cached einops-0.8.1-py3-none-any.whl.metadata (13 kB) Requirement already satisfied: lightgbm<4.7,>=4.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.tabular[all]==1.3.0->autogluon) (4.5.0) Requirement already satisfied: joblib<2,>=1.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) (1.4.2) Collecting pytorch-lightning (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached pytorch_lightning-2.5.1.post0-py3-none-any.whl.metadata (20 kB) Collecting gluonts<0.17,>=0.15.0 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached gluonts-0.16.1-py3-none-any.whl.metadata (9.8 kB) Collecting statsforecast<2.0.2,>=1.7.0 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached statsforecast-2.0.1-cp310-cp310-win_amd64.whl.metadata (30 kB) Collecting mlforecast<0.14,>0.13 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached mlforecast-0.13.6-py3-none-any.whl.metadata (12 kB) Collecting utilsforecast<0.2.11,>=0.2.3 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached utilsforecast-0.2.10-py3-none-any.whl.metadata (7.4 kB) Collecting coreforecast<0.0.16,>=0.0.12 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached coreforecast-0.0.15-cp310-cp310-win_amd64.whl.metadata (3.8 kB) Collecting fugue>=0.9.0 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached fugue-0.9.1-py3-none-any.whl.metadata (18 kB) Collecting orjson~=3.9 (from autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached orjson-3.10.18-cp310-cp310-win_amd64.whl.metadata (43 kB) Requirement already satisfied: psutil<7.1.0,>=5.7.3 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from autogluon.common==1.3.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (6.1.0) Requirement already satisfied: packaging>=20.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from accelerate<2.0,>=0.34.0->autogluon.multimodal==1.3.0->autogluon) (24.2) Collecting pyyaml (from accelerate<2.0,>=0.34.0->autogluon.multimodal==1.3.0->autogluon) Using cached PyYAML-6.0.2-cp310-cp310-win_amd64.whl.metadata (2.1 kB) Collecting safetensors>=0.4.3 (from accelerate<2.0,>=0.34.0->autogluon.multimodal==1.3.0->autogluon) Using cached safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB) Collecting botocore<1.39.0,>=1.38.13 (from boto3<2,>=1.10->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached botocore-1.38.13-py3-none-any.whl.metadata (5.7 kB) Collecting jmespath<2.0.0,>=0.7.1 (from boto3<2,>=1.10->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB) Collecting s3transfer<0.13.0,>=0.12.0 (from boto3<2,>=1.10->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached s3transfer-0.12.0-py3-none-any.whl.metadata (1.7 kB) Requirement already satisfied: graphviz in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from catboost<1.3,>=1.2->autogluon.tabular[all]==1.3.0->autogluon) (0.20.3) Requirement already satisfied: plotly in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from catboost<1.3,>=1.2->autogluon.tabular[all]==1.3.0->autogluon) (5.24.1) Requirement already satisfied: six in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from catboost<1.3,>=1.2->autogluon.tabular[all]==1.3.0->autogluon) (1.17.0) Collecting datasets>=2.0.0 (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached datasets-3.6.0-py3-none-any.whl.metadata (19 kB) Collecting dill (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB) Collecting xxhash (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached xxhash-3.5.0-cp310-cp310-win_amd64.whl.metadata (13 kB) Collecting multiprocess (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached multiprocess-0.70.18-py310-none-any.whl.metadata (7.5 kB) Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached fsspec-2025.3.2-py3-none-any.whl.metadata (11 kB) Requirement already satisfied: pip in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) (24.3.1) Collecting fastdownload<2,>=0.0.5 (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached fastdownload-0.0.7-py3-none-any.whl.metadata (5.5 kB) Collecting fastcore<1.9,>=1.8.0 (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached fastcore-1.8.2-py3-none-any.whl.metadata (3.7 kB) Collecting fasttransform>=0.0.2 (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached fasttransform-0.0.2-py3-none-any.whl.metadata (7.6 kB) Collecting fastprogress>=0.2.4 (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached fastprogress-1.0.3-py3-none-any.whl.metadata (5.6 kB) Collecting plum-dispatch (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached plum_dispatch-2.5.7-py3-none-any.whl.metadata (7.5 kB) Collecting cloudpickle (from fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB) Collecting triad>=0.9.7 (from fugue>=0.9.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached triad-0.9.8-py3-none-any.whl.metadata (6.3 kB) Collecting adagio>=0.2.4 (from fugue>=0.9.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached adagio-0.2.6-py3-none-any.whl.metadata (1.8 kB) Collecting pydantic<3,>=1.7 (from gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached pydantic-2.11.4-py3-none-any.whl.metadata (66 kB) Collecting toolz~=0.10 (from gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached toolz-0.12.1-py3-none-any.whl.metadata (5.1 kB) Requirement already satisfied: typing-extensions~=4.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) (4.12.2) Collecting future (from hyperopt<0.2.8,>=0.2.7->autogluon.core[all]==1.3.0->autogluon) Using cached future-1.0.0-py3-none-any.whl.metadata (4.0 kB) Collecting py4j (from hyperopt<0.2.8,>=0.2.7->autogluon.core[all]==1.3.0->autogluon) Using cached py4j-0.10.9.9-py2.py3-none-any.whl.metadata (1.3 kB) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from jinja2<3.2,>=3.0.3->autogluon.multimodal==1.3.0->autogluon) (3.0.2) Collecting attrs>=22.2.0 (from jsonschema<4.24,>=4.18->autogluon.multimodal==1.3.0->autogluon) Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB) Collecting jsonschema-specifications>=2023.03.6 (from jsonschema<4.24,>=4.18->autogluon.multimodal==1.3.0->autogluon) Using cached jsonschema_specifications-2025.4.1-py3-none-any.whl.metadata (2.9 kB) Collecting referencing>=0.28.4 (from jsonschema<4.24,>=4.18->autogluon.multimodal==1.3.0->autogluon) Using cached referencing-0.36.2-py3-none-any.whl.metadata (2.8 kB) Collecting rpds-py>=0.7.1 (from jsonschema<4.24,>=4.18->autogluon.multimodal==1.3.0->autogluon) Using cached rpds_py-0.24.0-cp310-cp310-win_amd64.whl.metadata (4.2 kB) Collecting lightning-utilities<2.0,>=0.10.0 (from lightning<2.7,>=2.2->autogluon.multimodal==1.3.0->autogluon) Using cached lightning_utilities-0.14.3-py3-none-any.whl.metadata (5.6 kB) Requirement already satisfied: contourpy>=1.0.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (4.55.3) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (1.4.7) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (3.2.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from matplotlib<3.11,>=3.7.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2.9.0.post0) Collecting numba (from mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached numba-0.61.2-cp310-cp310-win_amd64.whl.metadata (2.9 kB) Collecting optuna (from mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached optuna-4.3.0-py3-none-any.whl.metadata (17 kB) Collecting window-ops (from mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached window_ops-0.0.15-py3-none-any.whl.metadata (6.8 kB) Collecting gdown>=4.0.0 (from nlpaug<1.2.0,>=1.1.10->autogluon.multimodal==1.3.0->autogluon) Using cached gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB) Collecting click (from nltk<4.0,>=3.4.5->autogluon.multimodal==1.3.0->autogluon) Downloading click-8.2.0-py3-none-any.whl.metadata (2.5 kB) Collecting regex>=2021.8.3 (from nltk<4.0,>=3.4.5->autogluon.multimodal==1.3.0->autogluon) Using cached regex-2024.11.6-cp310-cp310-win_amd64.whl.metadata (41 kB) Collecting antlr4-python3-runtime==4.9.* (from omegaconf<2.4.0,>=2.1.1->autogluon.multimodal==1.3.0->autogluon) Using cached antlr4_python3_runtime-4.9.3-py3-none-any.whl Requirement already satisfied: colorama in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (0.4.6) Collecting model-index (from openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached model_index-0.1.11-py3-none-any.whl.metadata (3.9 kB) Collecting opendatalab (from openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached opendatalab-0.0.10-py3-none-any.whl.metadata (6.4 kB) Requirement already satisfied: rich in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (13.9.4) Collecting tabulate (from openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) Requirement already satisfied: pytz>=2020.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from pandas<2.3.0,>=2.0.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2024.2) Requirement already satisfied: tzdata>=2022.7 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from pandas<2.3.0,>=2.0.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2024.2) Collecting filelock (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB) Collecting msgpack<2.0.0,>=1.0.0 (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached msgpack-1.1.0-cp310-cp310-win_amd64.whl.metadata (8.6 kB) Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) (5.29.2) Collecting aiosignal (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB) Collecting frozenlist (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached frozenlist-1.6.0-cp310-cp310-win_amd64.whl.metadata (16 kB) Collecting aiohttp>=3.7 (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached aiohttp-3.11.18-cp310-cp310-win_amd64.whl.metadata (8.0 kB) Collecting aiohttp_cors (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached aiohttp_cors-0.8.1-py3-none-any.whl.metadata (20 kB) Collecting colorful (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached colorful-0.5.6-py2.py3-none-any.whl.metadata (16 kB) Collecting py-spy>=0.2.0 (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached py_spy-0.4.0-py2.py3-none-win_amd64.whl.metadata (16 kB) Requirement already satisfied: grpcio>=1.42.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) (1.68.1) Collecting opencensus (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached opencensus-0.11.4-py2.py3-none-any.whl.metadata (12 kB) Collecting prometheus_client>=0.7.1 (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached prometheus_client-0.21.1-py3-none-any.whl.metadata (1.8 kB) Collecting smart_open (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached smart_open-7.1.0-py3-none-any.whl.metadata (24 kB) Collecting virtualenv!=20.21.1,>=20.0.24 (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached virtualenv-20.31.2-py3-none-any.whl.metadata (4.5 kB) Collecting tensorboardX>=1.9 (from ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from requests->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from requests->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from requests->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from requests->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (2024.12.14) Collecting imageio!=2.35.0,>=2.33 (from scikit-image<0.26.0,>=0.19.1->autogluon.multimodal==1.3.0->autogluon) Using cached imageio-2.37.0-py3-none-any.whl.metadata (5.2 kB) Collecting tifffile>=2022.8.12 (from scikit-image<0.26.0,>=0.19.1->autogluon.multimodal==1.3.0->autogluon) Using cached tifffile-2025.5.10-py3-none-any.whl.metadata (31 kB) Collecting lazy-loader>=0.4 (from scikit-image<0.26.0,>=0.19.1->autogluon.multimodal==1.3.0->autogluon) Using cached lazy_loader-0.4-py3-none-any.whl.metadata (7.6 kB) Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from scikit-learn<1.7.0,>=1.4.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) (3.5.0) Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB) Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB) Collecting murmurhash<1.1.0,>=0.28.0 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached murmurhash-1.0.12-cp310-cp310-win_amd64.whl.metadata (2.2 kB) Collecting cymem<2.1.0,>=2.0.2 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached cymem-2.0.11-cp310-cp310-win_amd64.whl.metadata (8.8 kB) Collecting preshed<3.1.0,>=3.0.2 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached preshed-3.0.9-cp310-cp310-win_amd64.whl.metadata (2.2 kB) Collecting thinc<8.4.0,>=8.3.4 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached thinc-8.3.6-cp310-cp310-win_amd64.whl.metadata (15 kB) Collecting wasabi<1.2.0,>=0.9.1 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB) Collecting srsly<3.0.0,>=2.4.3 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached srsly-2.5.1-cp310-cp310-win_amd64.whl.metadata (20 kB) Collecting catalogue<2.1.0,>=2.0.6 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached catalogue-2.0.10-py3-none-any.whl.metadata (14 kB) Collecting weasel<0.5.0,>=0.1.0 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached weasel-0.4.1-py3-none-any.whl.metadata (4.6 kB) Collecting typer<1.0.0,>=0.3.0 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached typer-0.15.3-py3-none-any.whl.metadata (15 kB) Requirement already satisfied: setuptools in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) (75.6.0) Collecting langcodes<4.0.0,>=3.2.0 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached langcodes-3.5.0-py3-none-any.whl.metadata (29 kB) Requirement already satisfied: statsmodels>=0.13.2 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from statsforecast<2.0.2,>=1.7.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) (0.14.4) Requirement already satisfied: absl-py>=0.4 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from tensorboard<3,>=2.9->autogluon.multimodal==1.3.0->autogluon) (2.1.0) Requirement already satisfied: markdown>=2.6.8 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from tensorboard<3,>=2.9->autogluon.multimodal==1.3.0->autogluon) (3.7) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from tensorboard<3,>=2.9->autogluon.multimodal==1.3.0->autogluon) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from tensorboard<3,>=2.9->autogluon.multimodal==1.3.0->autogluon) (3.1.3) Collecting sympy==1.13.1 (from torch<2.7,>=2.2->autogluon.multimodal==1.3.0->autogluon) Using cached sympy-1.13.1-py3-none-any.whl.metadata (12 kB) Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch<2.7,>=2.2->autogluon.multimodal==1.3.0->autogluon) Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB) Collecting tokenizers<0.22,>=0.21 (from transformers<4.50,>=4.38.0->transformers[sentencepiece]<4.50,>=4.38.0->autogluon.multimodal==1.3.0->autogluon) Using cached tokenizers-0.21.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB) Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece]<4.50,>=4.38.0->autogluon.multimodal==1.3.0->autogluon) Using cached sentencepiece-0.2.0-cp310-cp310-win_amd64.whl.metadata (8.3 kB) Collecting aiohappyeyeballs>=2.3.0 (from aiohttp>=3.7->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB) Collecting async-timeout<6.0,>=4.0 (from aiohttp>=3.7->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached async_timeout-5.0.1-py3-none-any.whl.metadata (5.1 kB) Collecting multidict<7.0,>=4.5 (from aiohttp>=3.7->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached multidict-6.4.3-cp310-cp310-win_amd64.whl.metadata (5.5 kB) Collecting propcache>=0.2.0 (from aiohttp>=3.7->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached propcache-0.3.1-cp310-cp310-win_amd64.whl.metadata (11 kB) Collecting yarl<2.0,>=1.17.0 (from aiohttp>=3.7->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached yarl-1.20.0-cp310-cp310-win_amd64.whl.metadata (74 kB) Collecting dill (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB) Collecting multiprocess (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB) Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal==1.3.0->autogluon) Using cached fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB) Collecting beautifulsoup4 (from gdown>=4.0.0->nlpaug<1.2.0,>=1.1.10->autogluon.multimodal==1.3.0->autogluon) Using cached beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB) Collecting language-data>=1.2 (from langcodes<4.0.0,>=3.2.0->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached language_data-1.3.0-py3-none-any.whl.metadata (4.3 kB) Collecting llvmlite<0.45,>=0.44.0dev0 (from numba->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached llvmlite-0.44.0-cp310-cp310-win_amd64.whl.metadata (5.0 kB) Collecting annotated-types>=0.6.0 (from pydantic<3,>=1.7->gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB) Collecting pydantic-core==2.33.2 (from pydantic<3,>=1.7->gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached pydantic_core-2.33.2-cp310-cp310-win_amd64.whl.metadata (6.9 kB) Collecting typing-inspection>=0.4.0 (from pydantic<3,>=1.7->gluonts<0.17,>=0.15.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached typing_inspection-0.4.0-py3-none-any.whl.metadata (2.6 kB) Requirement already satisfied: patsy>=0.5.6 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from statsmodels>=0.13.2->statsforecast<2.0.2,>=1.7.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) (1.0.1) Collecting blis<1.4.0,>=1.3.0 (from thinc<8.4.0,>=8.3.4->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached blis-1.3.0-cp310-cp310-win_amd64.whl.metadata (7.6 kB) Collecting confection<1.0.0,>=0.0.1 (from thinc<8.4.0,>=8.3.4->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached confection-0.1.5-py3-none-any.whl.metadata (19 kB) INFO: pip is looking at multiple versions of thinc to determine which version is compatible with other requirements. This could take a while. Collecting thinc<8.4.0,>=8.3.4 (from spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached thinc-8.3.4-cp310-cp310-win_amd64.whl.metadata (15 kB) Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached blis-1.2.1-cp310-cp310-win_amd64.whl.metadata (7.6 kB) Collecting fs (from triad>=0.9.7->fugue>=0.9.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached fs-2.4.16-py2.py3-none-any.whl.metadata (6.3 kB) Collecting shellingham>=1.3.0 (from typer<1.0.0,>=0.3.0->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB) Requirement already satisfied: markdown-it-py>=2.2.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from rich->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from rich->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (2.18.0) Collecting distlib<1,>=0.3.7 (from virtualenv!=20.21.1,>=20.0.24->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB) Requirement already satisfied: platformdirs<5,>=3.9.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from virtualenv!=20.21.1,>=20.0.24->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) (4.3.6) Collecting cloudpathlib<1.0.0,>=0.7.0 (from weasel<0.5.0,>=0.1.0->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached cloudpathlib-0.21.0-py3-none-any.whl.metadata (14 kB) Requirement already satisfied: wrapt in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from smart_open->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) (1.17.0) Collecting ordered-set (from model-index->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached ordered_set-4.1.0-py3-none-any.whl.metadata (5.3 kB) Collecting opencensus-context>=0.1.3 (from opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached opencensus_context-0.1.3-py2.py3-none-any.whl.metadata (3.3 kB) Collecting google-api-core<3.0.0,>=1.0.0 (from opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached google_api_core-2.24.2-py3-none-any.whl.metadata (3.0 kB) Collecting pycryptodome (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached pycryptodome-3.22.0-cp37-abi3-win_amd64.whl.metadata (3.4 kB) Collecting openxlab (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached openxlab-0.1.2-py3-none-any.whl.metadata (3.8 kB) Requirement already satisfied: pywin32 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (310) Collecting alembic>=1.5.0 (from optuna->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB) Collecting colorlog (from optuna->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached colorlog-6.9.0-py3-none-any.whl.metadata (10 kB) Collecting sqlalchemy>=1.4.2 (from optuna->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached sqlalchemy-2.0.40-cp310-cp310-win_amd64.whl.metadata (9.9 kB) Requirement already satisfied: tenacity>=6.2.0 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from plotly->catboost<1.3,>=1.2->autogluon.tabular[all]==1.3.0->autogluon) (9.0.0) Collecting beartype>=0.16.2 (from plum-dispatch->fastai<2.9,>=2.3.1->autogluon.tabular[all]==1.3.0->autogluon) Using cached beartype-0.20.2-py3-none-any.whl.metadata (33 kB) Collecting Mako (from alembic>=1.5.0->optuna->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached mako-1.3.10-py3-none-any.whl.metadata (2.9 kB) Collecting googleapis-common-protos<2.0.0,>=1.56.2 (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached googleapis_common_protos-1.70.0-py3-none-any.whl.metadata (9.3 kB) Collecting proto-plus<2.0.0,>=1.22.3 (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached proto_plus-1.26.1-py3-none-any.whl.metadata (2.2 kB) Collecting google-auth<3.0.0,>=2.14.1 (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached google_auth-2.40.1-py2.py3-none-any.whl.metadata (6.2 kB) Collecting marisa-trie>=1.1.0 (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy<3.9->autogluon.tabular[all]==1.3.0->autogluon) Using cached marisa_trie-1.2.1-cp310-cp310-win_amd64.whl.metadata (9.3 kB) Requirement already satisfied: mdurl~=0.1 in c:\users\chunn\anaconda3\envs\ads3\lib\site-packages (from markdown-it-py>=2.2.0->rich->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) (0.1.2) Collecting greenlet>=1 (from sqlalchemy>=1.4.2->optuna->mlforecast<0.14,>0.13->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached greenlet-3.2.2-cp310-cp310-win_amd64.whl.metadata (4.2 kB) Collecting soupsieve>1.2 (from beautifulsoup4->gdown>=4.0.0->nlpaug<1.2.0,>=1.1.10->autogluon.multimodal==1.3.0->autogluon) Using cached soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB) Collecting appdirs~=1.4.3 (from fs->triad>=0.9.7->fugue>=0.9.0->autogluon.timeseries==1.3.0->autogluon.timeseries[all]==1.3.0->autogluon) Using cached appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB) Collecting filelock (from ray<2.45,>=2.10.0->ray[default]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB) Collecting oss2~=2.17.0 (from openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached oss2-2.17.0.tar.gz (259 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Collecting pytz>=2020.1 (from pandas<2.3.0,>=2.0.0->autogluon.core==1.3.0->autogluon.core[all]==1.3.0->autogluon) Using cached pytz-2023.4-py2.py3-none-any.whl.metadata (22 kB) INFO: pip is looking at multiple versions of openxlab to determine which version is compatible with other requirements. This could take a while. Collecting openxlab (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal==1.3.0->autogluon) Using cached openxlab-0.1.1-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.1.0-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.38-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.37-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.36-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.35-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.34-py3-none-any.whl.metadata (3.8 kB) INFO: pip is still looking at multiple versions of openxlab to determine which version is compatible with other requirements. This could take a while. Using cached openxlab-0.0.33-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.32-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.31-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.30-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.29-py3-none-any.whl.metadata (3.8 kB) INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C. Using cached openxlab-0.0.28-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.27-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.26-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.25-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.24-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.23-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.22-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.21-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.20-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.19-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.18-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.17-py3-none-any.whl.metadata (3.7 kB) Using cached openxlab-0.0.16-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.15-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.14-py3-none-any.whl.metadata (3.8 kB) Using cached openxlab-0.0.13-py3-none-any.whl.metadata (4.5 kB) Using cached openxlab-0.0.12-py3-none-any.whl.metadata (4.5 kB) Using cached openxlab-0.0.11-py3-none-any.whl.metadata (4.3 kB) Collecting PySocks!=1.5.7,>=1.5.6 (from requests[socks]->gdown>=4.0.0->nlpaug<1.2.0,>=1.1.10->autogluon.multimodal==1.3.0->autogluon) Using cached PySocks-1.7.1-py3-none-any.whl.metadata (13 kB) Collecting cachetools<6.0,>=2.0.0 (from google-auth<3.0.0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached cachetools-5.5.2-py3-none-any.whl.metadata (5.4 kB) Collecting pyasn1-modules>=0.2.1 (from google-auth<3.0.0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached pyasn1_modules-0.4.2-py3-none-any.whl.metadata (3.5 kB) Collecting rsa<5,>=3.1.4 (from google-auth<3.0.0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached rsa-4.9.1-py3-none-any.whl.metadata (5.6 kB) Collecting pyasn1<0.7.0,>=0.6.1 (from pyasn1-modules>=0.2.1->google-auth<3.0.0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.45,>=2.10.0; extra == "all"->autogluon.core[all]==1.3.0->autogluon) Using cached pyasn1-0.6.1-py3-none-any.whl.metadata (8.4 kB) Using cached autogluon-1.3.0-py3-none-any.whl (9.8 kB) Using cached autogluon.core-1.3.0-py3-none-any.whl (222 kB) Using cached autogluon.features-1.3.0-py3-none-any.whl (64 kB) Using cached autogluon.multimodal-1.3.0-py3-none-any.whl (454 kB) Using cached autogluon.tabular-1.3.0-py3-none-any.whl (382 kB) Using cached autogluon.timeseries-1.3.0-py3-none-any.whl (181 kB) Using cached autogluon.common-1.3.0-py3-none-any.whl (69 kB) Using cached accelerate-1.6.0-py3-none-any.whl (354 kB) Using cached boto3-1.38.13-py3-none-any.whl (139 kB) Using cached coreforecast-0.0.15-cp310-cp310-win_amd64.whl (189 kB) Using cached defusedxml-0.7.1-py2.py3-none-any.whl (25 kB) Using cached einops-0.8.1-py3-none-any.whl (64 kB) Using cached evaluate-0.4.3-py3-none-any.whl (84 kB) Using cached fastai-2.8.1-py3-none-any.whl (235 kB) Using cached fugue-0.9.1-py3-none-any.whl (278 kB) Using cached gluonts-0.16.1-py3-none-any.whl (1.5 MB) Using cached hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB) Using cached jsonschema-4.23.0-py3-none-any.whl (88 kB) Using cached lightning-2.5.1.post0-py3-none-any.whl (819 kB) Using cached mlforecast-0.13.6-py3-none-any.whl (71 kB) Using cached networkx-3.4.2-py3-none-any.whl (1.7 MB) Using cached nlpaug-1.1.11-py3-none-any.whl (410 kB) Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB) Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB) Using cached openmim-0.3.9-py2.py3-none-any.whl (52 kB) Using cached orjson-3.10.18-cp310-cp310-win_amd64.whl (134 kB) Using cached pdf2image-1.17.0-py3-none-any.whl (11 kB) Using cached pyarrow-20.0.0-cp310-cp310-win_amd64.whl (25.8 MB) Using cached pytesseract-0.3.13-py3-none-any.whl (14 kB) Using cached pytorch_metric_learning-2.8.1-py3-none-any.whl (125 kB) Using cached ray-2.44.1-cp310-cp310-win_amd64.whl (25.7 MB) Using cached scikit_image-0.25.2-cp310-cp310-win_amd64.whl (12.8 MB) Using cached spacy-3.8.5-cp310-cp310-win_amd64.whl (12.2 MB) Using cached statsforecast-2.0.1-cp310-cp310-win_amd64.whl (275 kB) Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB) Using cached timm-1.0.3-py3-none-any.whl (2.3 MB) Using cached torch-2.6.0-cp310-cp310-win_amd64.whl (204.2 MB) Using cached sympy-1.13.1-py3-none-any.whl (6.2 MB) Using cached torchmetrics-1.7.1-py3-none-any.whl (961 kB) Using cached torchvision-0.21.0-cp310-cp310-win_amd64.whl (1.6 MB) Using cached tqdm-4.67.1-py3-none-any.whl (78 kB) Using cached transformers-4.49.0-py3-none-any.whl (10.0 MB) Using cached utilsforecast-0.2.10-py3-none-any.whl (41 kB) Using cached pytorch_lightning-2.5.1.post0-py3-none-any.whl (823 kB) Using cached adagio-0.2.6-py3-none-any.whl (19 kB) Using cached aiohttp-3.11.18-cp310-cp310-win_amd64.whl (442 kB) Using cached aiosignal-1.3.2-py2.py3-none-any.whl (7.6 kB) Using cached attrs-25.3.0-py3-none-any.whl (63 kB) Using cached botocore-1.38.13-py3-none-any.whl (13.5 MB) Using cached catalogue-2.0.10-py3-none-any.whl (17 kB) Downloading click-8.2.0-py3-none-any.whl (102 kB) Using cached cymem-2.0.11-cp310-cp310-win_amd64.whl (39 kB) Using cached datasets-3.6.0-py3-none-any.whl (491 kB) Using cached dill-0.3.8-py3-none-any.whl (116 kB) Using cached fastcore-1.8.2-py3-none-any.whl (78 kB) Using cached fastdownload-0.0.7-py3-none-any.whl (12 kB) Using cached fastprogress-1.0.3-py3-none-any.whl (12 kB) Using cached fasttransform-0.0.2-py3-none-any.whl (14 kB) Using cached frozenlist-1.6.0-cp310-cp310-win_amd64.whl (120 kB) Using cached fsspec-2025.3.0-py3-none-any.whl (193 kB) Using cached gdown-5.2.0-py3-none-any.whl (18 kB) Using cached huggingface_hub-0.31.1-py3-none-any.whl (484 kB) Using cached imageio-2.37.0-py3-none-any.whl (315 kB) Using cached jmespath-1.0.1-py3-none-any.whl (20 kB) Using cached jsonschema_specifications-2025.4.1-py3-none-any.whl (18 kB) Using cached langcodes-3.5.0-py3-none-any.whl (182 kB) Using cached lazy_loader-0.4-py3-none-any.whl (12 kB) Using cached lightning_utilities-0.14.3-py3-none-any.whl (28 kB) Using cached msgpack-1.1.0-cp310-cp310-win_amd64.whl (74 kB) Using cached multiprocess-0.70.16-py310-none-any.whl (134 kB) Using cached murmurhash-1.0.12-cp310-cp310-win_amd64.whl (25 kB) Using cached numba-0.61.2-cp310-cp310-win_amd64.whl (2.8 MB) Using cached preshed-3.0.9-cp310-cp310-win_amd64.whl (122 kB) Using cached prometheus_client-0.21.1-py3-none-any.whl (54 kB) Using cached py_spy-0.4.0-py2.py3-none-win_amd64.whl (1.8 MB) Using cached pydantic-2.11.4-py3-none-any.whl (443 kB) Using cached pydantic_core-2.33.2-cp310-cp310-win_amd64.whl (2.0 MB) Using cached PyYAML-6.0.2-cp310-cp310-win_amd64.whl (161 kB) Using cached referencing-0.36.2-py3-none-any.whl (26 kB) Using cached regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB) Using cached rpds_py-0.24.0-cp310-cp310-win_amd64.whl (234 kB) Using cached s3transfer-0.12.0-py3-none-any.whl (84 kB) Using cached safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB) Using cached sentencepiece-0.2.0-cp310-cp310-win_amd64.whl (991 kB) Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB) Using cached spacy_loggers-1.0.5-py3-none-any.whl (22 kB) Using cached srsly-2.5.1-cp310-cp310-win_amd64.whl (632 kB) Using cached tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB) Using cached thinc-8.3.4-cp310-cp310-win_amd64.whl (1.5 MB) Using cached tifffile-2025.5.10-py3-none-any.whl (226 kB) Using cached tokenizers-0.21.1-cp39-abi3-win_amd64.whl (2.4 MB) Using cached toolz-0.12.1-py3-none-any.whl (56 kB) Using cached triad-0.9.8-py3-none-any.whl (62 kB) Using cached typer-0.15.3-py3-none-any.whl (45 kB) Using cached virtualenv-20.31.2-py3-none-any.whl (6.1 MB) Using cached filelock-3.18.0-py3-none-any.whl (16 kB) Using cached wasabi-1.1.3-py3-none-any.whl (27 kB) Using cached weasel-0.4.1-py3-none-any.whl (50 kB) Using cached smart_open-7.1.0-py3-none-any.whl (61 kB) Using cached aiohttp_cors-0.8.1-py3-none-any.whl (25 kB) Using cached cloudpickle-3.1.1-py3-none-any.whl (20 kB) Using cached colorful-0.5.6-py2.py3-none-any.whl (201 kB) Using cached future-1.0.0-py3-none-any.whl (491 kB) Using cached model_index-0.1.11-py3-none-any.whl (34 kB) Using cached opencensus-0.11.4-py2.py3-none-any.whl (128 kB) Using cached opendatalab-0.0.10-py3-none-any.whl (29 kB) Using cached optuna-4.3.0-py3-none-any.whl (386 kB) Using cached plum_dispatch-2.5.7-py3-none-any.whl (42 kB) Using cached py4j-0.10.9.9-py2.py3-none-any.whl (203 kB) Using cached tabulate-0.9.0-py3-none-any.whl (35 kB) Using cached window_ops-0.0.15-py3-none-any.whl (15 kB) Using cached xxhash-3.5.0-cp310-cp310-win_amd64.whl (30 kB) Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB) Using cached alembic-1.15.2-py3-none-any.whl (231 kB) Using cached annotated_types-0.7.0-py3-none-any.whl (13 kB) Using cached async_timeout-5.0.1-py3-none-any.whl (6.2 kB) Using cached beartype-0.20.2-py3-none-any.whl (1.2 MB) Using cached blis-1.2.1-cp310-cp310-win_amd64.whl (6.2 MB) Using cached cloudpathlib-0.21.0-py3-none-any.whl (52 kB) Using cached confection-0.1.5-py3-none-any.whl (35 kB) Using cached distlib-0.3.9-py2.py3-none-any.whl (468 kB) Using cached google_api_core-2.24.2-py3-none-any.whl (160 kB) Using cached language_data-1.3.0-py3-none-any.whl (5.4 MB) Using cached llvmlite-0.44.0-cp310-cp310-win_amd64.whl (30.3 MB) Using cached mpmath-1.3.0-py3-none-any.whl (536 kB) Using cached multidict-6.4.3-cp310-cp310-win_amd64.whl (38 kB) Using cached opencensus_context-0.1.3-py2.py3-none-any.whl (5.1 kB) Using cached propcache-0.3.1-cp310-cp310-win_amd64.whl (45 kB) Using cached shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB) Using cached sqlalchemy-2.0.40-cp310-cp310-win_amd64.whl (2.1 MB) Using cached typing_inspection-0.4.0-py3-none-any.whl (14 kB) Using cached yarl-1.20.0-cp310-cp310-win_amd64.whl (92 kB) Using cached beautifulsoup4-4.13.4-py3-none-any.whl (187 kB) Using cached colorlog-6.9.0-py3-none-any.whl (11 kB) Using cached fs-2.4.16-py2.py3-none-any.whl (135 kB) Using cached openxlab-0.0.11-py3-none-any.whl (55 kB) Using cached ordered_set-4.1.0-py3-none-any.whl (7.6 kB) Using cached pycryptodome-3.22.0-cp37-abi3-win_amd64.whl (1.8 MB) Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB) Using cached google_auth-2.40.1-py2.py3-none-any.whl (216 kB) Using cached googleapis_common_protos-1.70.0-py3-none-any.whl (294 kB) Using cached greenlet-3.2.2-cp310-cp310-win_amd64.whl (294 kB) Using cached marisa_trie-1.2.1-cp310-cp310-win_amd64.whl (151 kB) Using cached proto_plus-1.26.1-py3-none-any.whl (50 kB) Using cached PySocks-1.7.1-py3-none-any.whl (16 kB) Using cached soupsieve-2.7-py3-none-any.whl (36 kB) Using cached mako-1.3.10-py3-none-any.whl (78 kB) Using cached cachetools-5.5.2-py3-none-any.whl (10 kB) Using cached pyasn1_modules-0.4.2-py3-none-any.whl (181 kB) Using cached rsa-4.9.1-py3-none-any.whl (34 kB) Using cached pyasn1-0.6.1-py3-none-any.whl (83 kB) Installing collected packages: text-unidecode, sentencepiece, py4j, py-spy, opencensus-context, nvidia-ml-py3, mpmath, distlib, cymem, appdirs, antlr4-python3-runtime, xxhash, wasabi, typing-inspection, tqdm, toolz, tifffile, tensorboardX, tabulate, sympy, spacy-loggers, spacy-legacy, soupsieve, smart_open, shellingham, safetensors, rpds-py, regex, pyyaml, pytesseract, PySocks, pydantic-core, pycryptodome, pyasn1, pyarrow, proto-plus, propcache, prometheus_client, pdf2image, orjson, ordered-set, openxlab, networkx, murmurhash, multidict, msgpack, marisa-trie, Mako, llvmlite, lightning-utilities, lazy-loader, jmespath, imageio, greenlet, googleapis-common-protos, future, fsspec, fs, frozenlist, filelock, fastprogress, fastcore, einops, dill, defusedxml, coreforecast, colorlog, colorful, cloudpickle, cloudpathlib, click, catalogue, cachetools, blis, beartype, attrs, async-timeout, annotated-types, aiohappyeyeballs, yarl, virtualenv, torch, srsly, sqlalchemy, scikit-image, rsa, referencing, pydantic, pyasn1-modules, preshed, omegaconf, numba, nltk, multiprocess, model-index, language-data, hyperopt, huggingface-hub, fastdownload, botocore, beautifulsoup4, aiosignal, window-ops, utilsforecast, typer, triad, torchvision, torchmetrics, tokenizers, seqeval, s3transfer, pytorch-metric-learning, plum-dispatch, opendatalab, langcodes, jsonschema-specifications, google-auth, gluonts, gdown, confection, alembic, aiohttp, accelerate, weasel, transformers, timm, thinc, optuna, openmim, nlpaug, jsonschema, google-api-core, fasttransform, boto3, aiohttp_cors, adagio, spacy, ray, pytorch-lightning, opencensus, mlforecast, fugue, datasets, autogluon.common, statsforecast, lightning, fastai, evaluate, autogluon.features, autogluon.core, autogluon.tabular, autogluon.multimodal, autogluon.timeseries, autogluon Successfully installed Mako-1.3.10 PySocks-1.7.1 accelerate-1.6.0 adagio-0.2.6 aiohappyeyeballs-2.6.1 aiohttp-3.11.18 aiohttp_cors-0.8.1 aiosignal-1.3.2 alembic-1.15.2 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 appdirs-1.4.4 async-timeout-5.0.1 attrs-25.3.0 autogluon-1.3.0 autogluon.common-1.3.0 autogluon.core-1.3.0 autogluon.features-1.3.0 autogluon.multimodal-1.3.0 autogluon.tabular-1.3.0 autogluon.timeseries-1.3.0 beartype-0.20.2 beautifulsoup4-4.13.4 blis-1.2.1 boto3-1.38.13 botocore-1.38.13 cachetools-5.5.2 catalogue-2.0.10 click-8.2.0 cloudpathlib-0.21.0 cloudpickle-3.1.1 colorful-0.5.6 colorlog-6.9.0 confection-0.1.5 coreforecast-0.0.15 cymem-2.0.11 datasets-3.6.0 defusedxml-0.7.1 dill-0.3.8 distlib-0.3.9 einops-0.8.1 evaluate-0.4.3 fastai-2.8.1 fastcore-1.8.2 fastdownload-0.0.7 fastprogress-1.0.3 fasttransform-0.0.2 filelock-3.18.0 frozenlist-1.6.0 fs-2.4.16 fsspec-2025.3.0 fugue-0.9.1 future-1.0.0 gdown-5.2.0 gluonts-0.16.1 google-api-core-2.24.2 google-auth-2.40.1 googleapis-common-protos-1.70.0 greenlet-3.2.2 huggingface-hub-0.31.1 hyperopt-0.2.7 imageio-2.37.0 jmespath-1.0.1 jsonschema-4.23.0 jsonschema-specifications-2025.4.1 langcodes-3.5.0 language-data-1.3.0 lazy-loader-0.4 lightning-2.5.1.post0 lightning-utilities-0.14.3 llvmlite-0.44.0 marisa-trie-1.2.1 mlforecast-0.13.6 model-index-0.1.11 mpmath-1.3.0 msgpack-1.1.0 multidict-6.4.3 multiprocess-0.70.16 murmurhash-1.0.12 networkx-3.4.2 nlpaug-1.1.11 nltk-3.9.1 numba-0.61.2 nvidia-ml-py3-7.352.0 omegaconf-2.3.0 opencensus-0.11.4 opencensus-context-0.1.3 opendatalab-0.0.10 openmim-0.3.9 openxlab-0.0.11 optuna-4.3.0 ordered-set-4.1.0 orjson-3.10.18 pdf2image-1.17.0 plum-dispatch-2.5.7 preshed-3.0.9 prometheus_client-0.21.1 propcache-0.3.1 proto-plus-1.26.1 py-spy-0.4.0 py4j-0.10.9.9 pyarrow-20.0.0 pyasn1-0.6.1 pyasn1-modules-0.4.2 pycryptodome-3.22.0 pydantic-2.11.4 pydantic-core-2.33.2 pytesseract-0.3.13 pytorch-lightning-2.5.1.post0 pytorch-metric-learning-2.8.1 pyyaml-6.0.2 ray-2.44.1 referencing-0.36.2 regex-2024.11.6 rpds-py-0.24.0 rsa-4.9.1 s3transfer-0.12.0 safetensors-0.5.3 scikit-image-0.25.2 sentencepiece-0.2.0 seqeval-1.2.2 shellingham-1.5.4 smart_open-7.1.0 soupsieve-2.7 spacy-3.8.5 spacy-legacy-3.0.12 spacy-loggers-1.0.5 sqlalchemy-2.0.40 srsly-2.5.1 statsforecast-2.0.1 sympy-1.13.1 tabulate-0.9.0 tensorboardX-2.6.2.2 text-unidecode-1.3 thinc-8.3.4 tifffile-2025.5.10 timm-1.0.3 tokenizers-0.21.1 toolz-0.12.1 torch-2.6.0 torchmetrics-1.7.1 torchvision-0.21.0 tqdm-4.67.1 transformers-4.49.0 triad-0.9.8 typer-0.15.3 typing-inspection-0.4.0 utilsforecast-0.2.10 virtualenv-20.31.2 wasabi-1.1.3 weasel-0.4.1 window-ops-0.0.15 xxhash-3.5.0 yarl-1.20.0
[notice] A new release of pip is available: 24.3.1 -> 25.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip c:\Users\chunn\anaconda3\envs\ads3\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Combine features and target variable into a single dataframe as reuqired by AutoGluon
Xy_train = X_train.copy()
Xy_val = X_val.copy()
Xy_train['TARGET'] = y_train
Xy_val['TARGET'] = y_val.reset_index(drop=True) # Since StandardScaler above has reset the index
# Start timer to calculate the running time of training the CatBoost model
start_time = time.time()
# Ensure reproducibility
set_random_seed(RANDOM_SEED)
# Initialize and fit the AutoGluon predictor
AG = TabularPredictor(label='TARGET', eval_metric='roc_auc', path='autogluon_diabetes_auc').fit(
train_data=Xy_train,
time_limit=3600, # One hour to search for the best model
presets='best_quality', # Enables stacking, bagging, hyperparameter tuning for best possible performance
auto_stack=True, # Explicit stacking (included in 'best_quality')
num_stack_levels=2 # Number of meta-model layers
)
elapsed_time_AG = time.time() - start_time
print(f"Elapsed time (sec): {elapsed_time_AG:.2f}")
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version: 1.3.0
Python Version: 3.10.16
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.26100
CPU Count: 12
Memory Avail: 2.35 GB / 15.73 GB (14.9%)
Disk Space Avail: 512.29 GB / 936.72 GB (54.7%)
===================================================
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=2, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fit on subsets of the data. Then holdout validation data is used to detect stacked overfitting.
Running DyStack for up to 900s of the 3600s of remaining time (25%).
2025-05-13 11:24:07,066 INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
Running DyStack sub-fit in a ray process to avoid memory leakage. Enabling ray logging (enable_ray_logging=True). Specify `ds_args={'enable_ray_logging': False}` if you experience logging issues.
2025-05-13 11:24:12,436 INFO worker.py:1843 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8266
Context path: "c:\studium\DAV Ausbildung\CADS\Immersion\ads3_exam\03_scripts\autogluon_diabetes_auc\ds_sub_fit\sub_fit_ho"
(_dystack pid=32908) Running DyStack sub-fit ...
(_dystack pid=32908) Beginning AutoGluon training ... Time limit = 892s
(_dystack pid=32908) AutoGluon will save models to "c:\studium\DAV Ausbildung\CADS\Immersion\ads3_exam\03_scripts\autogluon_diabetes_auc\ds_sub_fit\sub_fit_ho"
(_dystack pid=32908) Train Data Rows: 11712
(_dystack pid=32908) Train Data Columns: 2248
(_dystack pid=32908) Label Column: TARGET
(_dystack pid=32908) Problem Type: binary
(_dystack pid=32908) Preprocessing data ...
(_dystack pid=32908) Selected class <--> label mapping: class 1 = 1, class 0 = 0
(_dystack pid=32908) Using Feature Generators to preprocess the data ...
(_dystack pid=32908) Fitting AutoMLPipelineFeatureGenerator...
(_dystack pid=32908) Available Memory: 2322.95 MB
(_dystack pid=32908) Train Data (Original) Memory Usage: 200.87 MB (8.6% of available memory)
(_dystack pid=32908) Warning: Data size prior to feature transformation consumes 8.6% of available memory. Consider increasing memory or subsampling the data to avoid instability.
(_dystack pid=32908) Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
(_dystack pid=32908) Stage 1 Generators:
(_dystack pid=32908) Fitting AsTypeFeatureGenerator...
(_dystack pid=32908) Note: Converting 1667 features to boolean dtype as they only contain 2 unique values.
(_dystack pid=32908) Stage 2 Generators:
(_dystack pid=32908) Fitting FillNaFeatureGenerator...
(_dystack pid=32908) Stage 3 Generators:
(_dystack pid=32908) Fitting IdentityFeatureGenerator...
(_dystack pid=32908) Stage 4 Generators:
(_dystack pid=32908) Fitting DropUniqueFeatureGenerator...
(_dystack pid=32908) Stage 5 Generators:
(_dystack pid=32908) Fitting DropDuplicatesFeatureGenerator...
(_dystack pid=32908) Useless Original Features (Count: 573): ['diag_1_110', 'diag_1_143', 'diag_1_146', 'diag_1_147', 'diag_1_149', 'diag_1_160', 'diag_1_170', 'diag_1_173', 'diag_1_175', 'diag_1_187', 'diag_1_192', 'diag_1_194', 'diag_1_195', 'diag_1_200', 'diag_1_201', 'diag_1_207', 'diag_1_208', 'diag_1_210', 'diag_1_212', 'diag_1_216', 'diag_1_219', 'diag_1_228', 'diag_1_236', 'diag_1_246', 'diag_1_25021', 'diag_1_25032', 'diag_1_2505', 'diag_1_25053', 'diag_1_2509', 'diag_1_25091', 'diag_1_262', 'diag_1_266', 'diag_1_272', 'diag_1_301', 'diag_1_308', 'diag_1_318', 'diag_1_320', 'diag_1_322', 'diag_1_334', 'diag_1_335', 'diag_1_336', 'diag_1_337', 'diag_1_353', 'diag_1_36', 'diag_1_362', 'diag_1_363', 'diag_1_374', 'diag_1_377', 'diag_1_378', 'diag_1_379', 'diag_1_381', 'diag_1_382', 'diag_1_383', 'diag_1_385', 'diag_1_388', 'diag_1_39', 'diag_1_397', 'diag_1_412', 'diag_1_422', 'diag_1_445', 'diag_1_448', 'diag_1_454', 'diag_1_457', 'diag_1_470', 'diag_1_477', 'diag_1_48', 'diag_1_49', 'diag_1_495', 'diag_1_501', 'diag_1_52', 'diag_1_521', 'diag_1_524', 'diag_1_526', 'diag_1_527', 'diag_1_528', 'diag_1_529', 'diag_1_542', 'diag_1_551', 'diag_1_565', 'diag_1_57', 'diag_1_573', 'diag_1_580', 'diag_1_582', 'diag_1_588', 'diag_1_603', 'diag_1_605', 'diag_1_61', 'diag_1_610', 'diag_1_616', 'diag_1_632', 'diag_1_634', 'diag_1_643', 'diag_1_645', 'diag_1_646', 'diag_1_649', 'diag_1_653', 'diag_1_657', 'diag_1_66', 'diag_1_665', 'diag_1_669', 'diag_1_671', 'diag_1_683', 'diag_1_684', 'diag_1_685', 'diag_1_690', 'diag_1_691', 'diag_1_692', 'diag_1_693', 'diag_1_695', 'diag_1_7', 'diag_1_706', 'diag_1_709', 'diag_1_732', 'diag_1_734', 'diag_1_735', 'diag_1_745', 'diag_1_746', 'diag_1_747', 'diag_1_75', 'diag_1_791', 'diag_1_792', 'diag_1_795', 'diag_1_800', 'diag_1_810', 'diag_1_817', 'diag_1_831', 'diag_1_832', 'diag_1_835', 'diag_1_836', 'diag_1_84', 'diag_1_840', 'diag_1_845', 'diag_1_846', 'diag_1_848', 'diag_1_854', 'diag_1_862', 'diag_1_863', 'diag_1_868', 'diag_1_870', 'diag_1_871', 'diag_1_875', 'diag_1_879', 'diag_1_885', 'diag_1_886', 'diag_1_891', 'diag_1_892', 'diag_1_893', 'diag_1_897', 'diag_1_903', 'diag_1_906', 'diag_1_911', 'diag_1_914', 'diag_1_916', 'diag_1_921', 'diag_1_934', 'diag_1_936', 'diag_1_941', 'diag_1_955', 'diag_1_957', 'diag_1_963', 'diag_1_968', 'diag_1_976', 'diag_1_977', 'diag_1_98', 'diag_1_982', 'diag_1_983', 'diag_1_987', 'diag_1_994', 'diag_1_E909', 'diag_1_V45', 'diag_1_V51', 'diag_1_V55', 'diag_1_V56', 'diag_1_V70', 'diag_2_111', 'diag_2_114', 'diag_2_123', 'diag_2_131', 'diag_2_137', 'diag_2_145', 'diag_2_152', 'diag_2_163', 'diag_2_164', 'diag_2_173', 'diag_2_180', 'diag_2_188', 'diag_2_192', 'diag_2_212', 'diag_2_217', 'diag_2_225', 'diag_2_228', 'diag_2_235', 'diag_2_239', 'diag_2_240', 'diag_2_25023', 'diag_2_259', 'diag_2_268', 'diag_2_269', 'diag_2_27', 'diag_2_271', 'diag_2_281', 'diag_2_297', 'diag_2_306', 'diag_2_308', 'diag_2_31', 'diag_2_310', 'diag_2_316', 'diag_2_317', 'diag_2_324', 'diag_2_333', 'diag_2_338', 'diag_2_34', 'diag_2_347', 'diag_2_35', 'diag_2_351', 'diag_2_352', 'diag_2_353', 'diag_2_354', 'diag_2_355', 'diag_2_369', 'diag_2_373', 'diag_2_374', 'diag_2_376', 'diag_2_379', 'diag_2_382', 'diag_2_383', 'diag_2_389', 'diag_2_394', 'diag_2_395', 'diag_2_40', 'diag_2_422', 'diag_2_430', 'diag_2_442', 'diag_2_464', 'diag_2_470', 'diag_2_484', 'diag_2_501', 'diag_2_514', 'diag_2_520', 'diag_2_521', 'diag_2_522', 'diag_2_523', 'diag_2_527', 'diag_2_529', 'diag_2_534', 'diag_2_550', 'diag_2_557', 'diag_2_570', 'diag_2_579', 'diag_2_580', 'diag_2_588', 'diag_2_598', 'diag_2_604', 'diag_2_605', 'diag_2_615', 'diag_2_619', 'diag_2_622', 'diag_2_623', 'diag_2_627', 'diag_2_634', 'diag_2_641', 'diag_2_644', 'diag_2_645', 'diag_2_652', 'diag_2_658', 'diag_2_659', 'diag_2_661', 'diag_2_663', 'diag_2_664', 'diag_2_665', 'diag_2_674', 'diag_2_683', 'diag_2_686', 'diag_2_696', 'diag_2_702', 'diag_2_703', 'diag_2_704', 'diag_2_705', 'diag_2_706', 'diag_2_709', 'diag_2_713', 'diag_2_725', 'diag_2_734', 'diag_2_741', 'diag_2_742', 'diag_2_747', 'diag_2_748', 'diag_2_751', 'diag_2_753', 'diag_2_756', 'diag_2_758', 'diag_2_759', 'diag_2_795', 'diag_2_797', 'diag_2_800', 'diag_2_801', 'diag_2_814', 'diag_2_815', 'diag_2_831', 'diag_2_832', 'diag_2_837', 'diag_2_843', 'diag_2_847', 'diag_2_861', 'diag_2_863', 'diag_2_867', 'diag_2_868', 'diag_2_870', 'diag_2_872', 'diag_2_880', 'diag_2_906', 'diag_2_910', 'diag_2_913', 'diag_2_915', 'diag_2_918', 'diag_2_922', 'diag_2_927', 'diag_2_933', 'diag_2_942', 'diag_2_944', 'diag_2_947', 'diag_2_952', 'diag_2_953', 'diag_2_962', 'diag_2_965', 'diag_2_967', 'diag_2_969', 'diag_2_99', 'diag_2_991', 'diag_2_992', 'diag_2_E816', 'diag_2_E819', 'diag_2_E821', 'diag_2_E850', 'diag_2_E868', 'diag_2_E882', 'diag_2_E883', 'diag_2_E887', 'diag_2_E890', 'diag_2_E905', 'diag_2_E906', 'diag_2_E916', 'diag_2_E924', 'diag_2_E928', 'diag_2_E930', 'diag_2_E931', 'diag_2_E936', 'diag_2_E938', 'diag_2_E944', 'diag_2_E965', 'diag_2_E968', 'diag_2_E980', 'diag_2_V11', 'diag_2_V13', 'diag_2_V16', 'diag_2_V23', 'diag_2_V25', 'diag_2_V46', 'diag_2_V50', 'diag_2_V53', 'diag_2_V57', 'diag_2_V60', 'diag_2_V61', 'diag_2_V86', 'diag_3_115', 'diag_3_117', 'diag_3_122', 'diag_3_123', 'diag_3_136', 'diag_3_139', 'diag_3_14', 'diag_3_141', 'diag_3_150', 'diag_3_152', 'diag_3_161', 'diag_3_163', 'diag_3_17', 'diag_3_170', 'diag_3_171', 'diag_3_172', 'diag_3_182', 'diag_3_193', 'diag_3_214', 'diag_3_217', 'diag_3_223', 'diag_3_227', 'diag_3_233', 'diag_3_235', 'diag_3_240', 'diag_3_245', 'diag_3_25022', 'diag_3_2503', 'diag_3_25031', 'diag_3_258', 'diag_3_259', 'diag_3_265', 'diag_3_270', 'diag_3_271', 'diag_3_299', 'diag_3_314', 'diag_3_315', 'diag_3_318', 'diag_3_323', 'diag_3_35', 'diag_3_350', 'diag_3_351', 'diag_3_359', 'diag_3_360', 'diag_3_361', 'diag_3_36544', 'diag_3_366', 'diag_3_370', 'diag_3_373', 'diag_3_374', 'diag_3_376', 'diag_3_379', 'diag_3_380', 'diag_3_381', 'diag_3_382', 'diag_3_383', 'diag_3_385', 'diag_3_387', 'diag_3_388', 'diag_3_395', 'diag_3_405', 'diag_3_417', 'diag_3_421', 'diag_3_430', 'diag_3_432', 'diag_3_436', 'diag_3_460', 'diag_3_463', 'diag_3_464', 'diag_3_47', 'diag_3_470', 'diag_3_472', 'diag_3_480', 'diag_3_49', 'diag_3_5', 'diag_3_500', 'diag_3_501', 'diag_3_506', 'diag_3_508', 'diag_3_514', 'diag_3_516', 'diag_3_522', 'diag_3_523', 'diag_3_524', 'diag_3_525', 'diag_3_527', 'diag_3_534', 'diag_3_540', 'diag_3_542', 'diag_3_543', 'diag_3_556', 'diag_3_565', 'diag_3_566', 'diag_3_579', 'diag_3_597', 'diag_3_598', 'diag_3_602', 'diag_3_608', 'diag_3_624', 'diag_3_641', 'diag_3_647', 'diag_3_655', 'diag_3_657', 'diag_3_665', 'diag_3_669', 'diag_3_674', 'diag_3_685', 'diag_3_690', 'diag_3_694', 'diag_3_702', 'diag_3_706', 'diag_3_712', 'diag_3_734', 'diag_3_735', 'diag_3_741', 'diag_3_744', 'diag_3_746', 'diag_3_747', 'diag_3_750', 'diag_3_754', 'diag_3_755', 'diag_3_758', 'diag_3_759', 'diag_3_78', 'diag_3_801', 'diag_3_810', 'diag_3_814', 'diag_3_822', 'diag_3_826', 'diag_3_831', 'diag_3_834', 'diag_3_844', 'diag_3_845', 'diag_3_847', 'diag_3_851', 'diag_3_853', 'diag_3_854', 'diag_3_862', 'diag_3_864', 'diag_3_866', 'diag_3_868', 'diag_3_871', 'diag_3_872', 'diag_3_876', 'diag_3_877', 'diag_3_881', 'diag_3_883', 'diag_3_890', 'diag_3_906', 'diag_3_910', 'diag_3_911', 'diag_3_912', 'diag_3_918', 'diag_3_919', 'diag_3_921', 'diag_3_930', 'diag_3_933', 'diag_3_943', 'diag_3_945', 'diag_3_951', 'diag_3_952', 'diag_3_953', 'diag_3_955', 'diag_3_956', 'diag_3_962', 'diag_3_965', 'diag_3_966', 'diag_3_967', 'diag_3_969', 'diag_3_970', 'diag_3_972', 'diag_3_987', 'diag_3_989', 'diag_3_E815', 'diag_3_E817', 'diag_3_E818', 'diag_3_E819', 'diag_3_E822', 'diag_3_E825', 'diag_3_E826', 'diag_3_E828', 'diag_3_E850', 'diag_3_E852', 'diag_3_E853', 'diag_3_E858', 'diag_3_E861', 'diag_3_E882', 'diag_3_E883', 'diag_3_E887', 'diag_3_E905', 'diag_3_E906', 'diag_3_E916', 'diag_3_E917', 'diag_3_E924', 'diag_3_E930', 'diag_3_E941', 'diag_3_E943', 'diag_3_E949', 'diag_3_E966', 'diag_3_E980', 'diag_3_V02', 'diag_3_V03', 'diag_3_V13', 'diag_3_V16', 'diag_3_V23', 'diag_3_V57', 'diag_3_V61', 'diag_3_V66', 'diag_3_V86']
(_dystack pid=32908) These features carry no predictive signal and should be manually investigated.
(_dystack pid=32908) This is typically a feature which has the same value for all rows.
(_dystack pid=32908) These features do not need to be present at inference time.
(_dystack pid=32908) Unused Original Features (Count: 610): ['payer_code_MP', 'medical_specialty_ObstericsGynecologyGynecologicOnco', 'medical_specialty_Obstetrics', 'medical_specialty_Osteopath', 'medical_specialty_Podiatry', 'medical_specialty_Rheumatology', 'medical_specialty_SurgeryPlastic', 'medical_specialty_SurgicalSpecialty', 'diag_1_11', 'diag_1_115', 'diag_1_117', 'diag_1_133', 'diag_1_135', 'diag_1_136', 'diag_1_142', 'diag_1_155', 'diag_1_161', 'diag_1_163', 'diag_1_164', 'diag_1_171', 'diag_1_172', 'diag_1_179', 'diag_1_180', 'diag_1_184', 'diag_1_199', 'diag_1_203', 'diag_1_205', 'diag_1_214', 'diag_1_215', 'diag_1_229', 'diag_1_230', 'diag_1_233', 'diag_1_237', 'diag_1_239', 'diag_1_240', 'diag_1_244', 'diag_1_245', 'diag_1_2503', 'diag_1_25031', 'diag_1_25033', 'diag_1_25043', 'diag_1_25051', 'diag_1_25093', 'diag_1_251', 'diag_1_252', 'diag_1_253', 'diag_1_255', 'diag_1_263', 'diag_1_273', 'diag_1_277', 'diag_1_282', 'diag_1_283', 'diag_1_284', 'diag_1_288', 'diag_1_293', 'diag_1_299', 'diag_1_3', 'diag_1_305', 'diag_1_306', 'diag_1_307', 'diag_1_31', 'diag_1_314', 'diag_1_323', 'diag_1_325', 'diag_1_338', 'diag_1_34', 'diag_1_347', 'diag_1_350', 'diag_1_352', 'diag_1_354', 'diag_1_355', 'diag_1_356', 'diag_1_357', 'diag_1_360', 'diag_1_361', 'diag_1_368', 'diag_1_370', 'diag_1_376', 'diag_1_380', 'diag_1_384', 'diag_1_391', 'diag_1_394', 'diag_1_395', 'diag_1_405', 'diag_1_41', 'diag_1_416', 'diag_1_417', 'diag_1_423', 'diag_1_429', 'diag_1_430', 'diag_1_442', 'diag_1_452', 'diag_1_456', 'diag_1_462', 'diag_1_463', 'diag_1_464', 'diag_1_474', 'diag_1_475', 'diag_1_480', 'diag_1_483', 'diag_1_487', 'diag_1_492', 'diag_1_5', 'diag_1_508', 'diag_1_510', 'diag_1_512', 'diag_1_513', 'diag_1_514', 'diag_1_522', 'diag_1_54', 'diag_1_541', 'diag_1_567', 'diag_1_570', 'diag_1_579', 'diag_1_581', 'diag_1_586', 'diag_1_591', 'diag_1_594', 'diag_1_602', 'diag_1_608', 'diag_1_611', 'diag_1_614', 'diag_1_617', 'diag_1_619', 'diag_1_620', 'diag_1_622', 'diag_1_623', 'diag_1_633', 'diag_1_641', 'diag_1_647', 'diag_1_652', 'diag_1_655', 'diag_1_656', 'diag_1_658', 'diag_1_660', 'diag_1_664', 'diag_1_674', 'diag_1_686', 'diag_1_705', 'diag_1_708', 'diag_1_710', 'diag_1_717', 'diag_1_718', 'diag_1_725', 'diag_1_736', 'diag_1_751', 'diag_1_753', 'diag_1_759', 'diag_1_78', 'diag_1_797', 'diag_1_801', 'diag_1_802', 'diag_1_814', 'diag_1_815', 'diag_1_816', 'diag_1_82', 'diag_1_826', 'diag_1_833', 'diag_1_834', 'diag_1_847', 'diag_1_853', 'diag_1_861', 'diag_1_865', 'diag_1_866', 'diag_1_867', 'diag_1_873', 'diag_1_88', 'diag_1_880', 'diag_1_883', 'diag_1_915', 'diag_1_920', 'diag_1_933', 'diag_1_935', 'diag_1_939', 'diag_1_94', 'diag_1_942', 'diag_1_945', 'diag_1_952', 'diag_1_958', 'diag_1_959', 'diag_1_964', 'diag_1_965', 'diag_1_967', 'diag_1_970', 'diag_1_991', 'diag_1_992', 'diag_1_999', 'diag_1_V26', 'diag_1_V60', 'diag_1_V63', 'diag_1_V71', 'diag_2_110', 'diag_2_117', 'diag_2_136', 'diag_2_138', 'diag_2_140', 'diag_2_156', 'diag_2_172', 'diag_2_174', 'diag_2_179', 'diag_2_183', 'diag_2_185', 'diag_2_186', 'diag_2_191', 'diag_2_193', 'diag_2_200', 'diag_2_201', 'diag_2_203', 'diag_2_208', 'diag_2_215', 'diag_2_218', 'diag_2_220', 'diag_2_226', 'diag_2_233', 'diag_2_238', 'diag_2_241', 'diag_2_242', 'diag_2_246', 'diag_2_25011', 'diag_2_25022', 'diag_2_25033', 'diag_2_25053', 'diag_2_2509', 'diag_2_25091', 'diag_2_25093', 'diag_2_251', 'diag_2_253', 'diag_2_258', 'diag_2_260', 'diag_2_261', 'diag_2_262', 'diag_2_273', 'diag_2_274', 'diag_2_275', 'diag_2_277', 'diag_2_279', 'diag_2_282', 'diag_2_283', 'diag_2_298', 'diag_2_299', 'diag_2_301', 'diag_2_302', 'diag_2_307', 'diag_2_311', 'diag_2_314', 'diag_2_320', 'diag_2_322', 'diag_2_323', 'diag_2_327', 'diag_2_335', 'diag_2_336', 'diag_2_343', 'diag_2_345', 'diag_2_356', 'diag_2_358', 'diag_2_362', 'diag_2_365', 'diag_2_372', 'diag_2_377', 'diag_2_378', 'diag_2_380', 'diag_2_381', 'diag_2_398', 'diag_2_405', 'diag_2_421', 'diag_2_429', 'diag_2_431', 'diag_2_432', 'diag_2_446', 'diag_2_447', 'diag_2_451', 'diag_2_454', 'diag_2_455', 'diag_2_456', 'diag_2_457', 'diag_2_46', 'diag_2_460', 'diag_2_461', 'diag_2_462', 'diag_2_472', 'diag_2_474', 'diag_2_477', 'diag_2_478', 'diag_2_480', 'diag_2_485', 'diag_2_490', 'diag_2_494', 'diag_2_495', 'diag_2_500', 'diag_2_510', 'diag_2_512', 'diag_2_516', 'diag_2_517', 'diag_2_519', 'diag_2_524', 'diag_2_528', 'diag_2_53', 'diag_2_533', 'diag_2_54', 'diag_2_542', 'diag_2_543', 'diag_2_565', 'diag_2_566', 'diag_2_572', 'diag_2_573', 'diag_2_583', 'diag_2_586', 'diag_2_594', 'diag_2_596', 'diag_2_601', 'diag_2_602', 'diag_2_607', 'diag_2_608', 'diag_2_620', 'diag_2_621', 'diag_2_626', 'diag_2_646', 'diag_2_647', 'diag_2_649', 'diag_2_654', 'diag_2_656', 'diag_2_670', 'diag_2_691', 'diag_2_692', 'diag_2_693', 'diag_2_698', 'diag_2_712', 'diag_2_714', 'diag_2_716', 'diag_2_718', 'diag_2_719', 'diag_2_737', 'diag_2_752', 'diag_2_755', 'diag_2_783', 'diag_2_79', 'diag_2_791', 'diag_2_792', 'diag_2_802', 'diag_2_808', 'diag_2_810', 'diag_2_812', 'diag_2_813', 'diag_2_816', 'diag_2_821', 'diag_2_822', 'diag_2_825', 'diag_2_833', 'diag_2_836', 'diag_2_840', 'diag_2_842', 'diag_2_844', 'diag_2_851', 'diag_2_864', 'diag_2_865', 'diag_2_866', 'diag_2_869', 'diag_2_871', 'diag_2_88', 'diag_2_883', 'diag_2_892', 'diag_2_907', 'diag_2_909', 'diag_2_911', 'diag_2_916', 'diag_2_919', 'diag_2_921', 'diag_2_923', 'diag_2_934', 'diag_2_948', 'diag_2_955', 'diag_2_958', 'diag_2_972', 'diag_2_980', 'diag_2_989', 'diag_2_999', 'diag_2_E826', 'diag_2_E870', 'diag_2_E879', 'diag_2_E881', 'diag_2_E884', 'diag_2_E917', 'diag_2_E918', 'diag_2_E919', 'diag_2_E927', 'diag_2_E933', 'diag_2_E934', 'diag_2_E935', 'diag_2_E942', 'diag_2_V02', 'diag_2_V08', 'diag_2_V14', 'diag_2_V44', 'diag_2_V49', 'diag_2_V63', 'diag_2_V65', 'diag_3_155', 'diag_3_156', 'diag_3_157', 'diag_3_158', 'diag_3_173', 'diag_3_179', 'diag_3_180', 'diag_3_183', 'diag_3_188', 'diag_3_191', 'diag_3_192', 'diag_3_199', 'diag_3_201', 'diag_3_202', 'diag_3_208', 'diag_3_220', 'diag_3_225', 'diag_3_228', 'diag_3_241', 'diag_3_25013', 'diag_3_2502', 'diag_3_25043', 'diag_3_2507', 'diag_3_25083', 'diag_3_2509', 'diag_3_25091', 'diag_3_25093', 'diag_3_251', 'diag_3_252', 'diag_3_256', 'diag_3_262', 'diag_3_277', 'diag_3_279', 'diag_3_281', 'diag_3_283', 'diag_3_289', 'diag_3_297', 'diag_3_298', 'diag_3_306', 'diag_3_313', 'diag_3_317', 'diag_3_332', 'diag_3_334', 'diag_3_335', 'diag_3_336', 'diag_3_343', 'diag_3_347', 'diag_3_353', 'diag_3_354', 'diag_3_355', 'diag_3_356', 'diag_3_365', 'diag_3_368', 'diag_3_369', 'diag_3_372', 'diag_3_377', 'diag_3_384', 'diag_3_389', 'diag_3_391', 'diag_3_394', 'diag_3_420', 'diag_3_423', 'diag_3_431', 'diag_3_442', 'diag_3_444', 'diag_3_446', 'diag_3_451', 'diag_3_457', 'diag_3_465', 'diag_3_477', 'diag_3_494', 'diag_3_495', 'diag_3_510', 'diag_3_519', 'diag_3_528', 'diag_3_529', 'diag_3_53', 'diag_3_531', 'diag_3_532', 'diag_3_533', 'diag_3_537', 'diag_3_54', 'diag_3_550', 'diag_3_555', 'diag_3_568', 'diag_3_57', 'diag_3_572', 'diag_3_573', 'diag_3_575', 'diag_3_576', 'diag_3_582', 'diag_3_586', 'diag_3_592', 'diag_3_594', 'diag_3_604', 'diag_3_605', 'diag_3_607', 'diag_3_611', 'diag_3_619', 'diag_3_620', 'diag_3_621', 'diag_3_623', 'diag_3_626', 'diag_3_627', 'diag_3_644', 'diag_3_646', 'diag_3_649', 'diag_3_653', 'diag_3_654', 'diag_3_656', 'diag_3_660', 'diag_3_661', 'diag_3_663', 'diag_3_664', 'diag_3_670', 'diag_3_692', 'diag_3_693', 'diag_3_695', 'diag_3_696', 'diag_3_697', 'diag_3_701', 'diag_3_703', 'diag_3_708', 'diag_3_709', 'diag_3_717', 'diag_3_718', 'diag_3_720', 'diag_3_722', 'diag_3_725', 'diag_3_726', 'diag_3_745', 'diag_3_751', 'diag_3_752', 'diag_3_753', 'diag_3_756', 'diag_3_791', 'diag_3_792', 'diag_3_793', 'diag_3_796', 'diag_3_797', 'diag_3_800', 'diag_3_807', 'diag_3_808', 'diag_3_815', 'diag_3_816', 'diag_3_823', 'diag_3_836', 'diag_3_837', 'diag_3_838', 'diag_3_840', 'diag_3_841', 'diag_3_842', 'diag_3_850', 'diag_3_861', 'diag_3_865', 'diag_3_867', 'diag_3_875', 'diag_3_880', 'diag_3_882', 'diag_3_891', 'diag_3_892', 'diag_3_907', 'diag_3_916', 'diag_3_917', 'diag_3_922', 'diag_3_924', 'diag_3_934', 'diag_3_94', 'diag_3_948', 'diag_3_959', 'diag_3_971', 'diag_3_E812', 'diag_3_E816', 'diag_3_E870', 'diag_3_E884', 'diag_3_E892', 'diag_3_E894', 'diag_3_E900', 'diag_3_E912', 'diag_3_E919', 'diag_3_E920', 'diag_3_E922', 'diag_3_E927', 'diag_3_E928', 'diag_3_E929', 'diag_3_E931', 'diag_3_E933', 'diag_3_E934', 'diag_3_E935', 'diag_3_E938', 'diag_3_E944', 'diag_3_E946', 'diag_3_E950', 'diag_3_E956', 'diag_3_V06', 'diag_3_V09', 'diag_3_V11', 'diag_3_V44', 'diag_3_V46', 'diag_3_V53', 'diag_3_V55', 'diag_3_V62', 'diag_3_V63', 'diag_3_V65', 'diag_3_V70', 'nateglinide_Up', 'chlorpropamide_Steady', 'tolbutamide_Steady', 'acarbose_Up', 'miglitol_Steady', 'tolazamide_Steady', 'group_diag_1_na', 'group_diag_2_na', 'group_diag_3_na']
(_dystack pid=32908) These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
(_dystack pid=32908) Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
(_dystack pid=32908) These features do not need to be present at inference time.
(_dystack pid=32908) ('float', []) : 610 | ['payer_code_MP', 'medical_specialty_ObstericsGynecologyGynecologicOnco', 'medical_specialty_Obstetrics', 'medical_specialty_Osteopath', 'medical_specialty_Podiatry', ...]
(_dystack pid=32908) Types of features in original data (raw dtype, special dtypes):
(_dystack pid=32908) ('float', []) : 1065 | ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', ...]
(_dystack pid=32908) Types of features in processed data (raw dtype, special dtypes):
(_dystack pid=32908) ('float', []) : 8 | ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', ...]
(_dystack pid=32908) ('int', ['bool']) : 1057 | ['race_AfricanAmerican', 'race_Asian', 'race_Caucasian', 'race_Hispanic', 'race_Other', ...]
(_dystack pid=32908) 11.9s = Fit runtime
(_dystack pid=32908) 1065 features in original data used to generate 1065 features in processed data.
(_dystack pid=32908) Train Data (Processed) Memory Usage: 12.52 MB (0.5% of available memory)
(_dystack pid=32908) Data preprocessing and feature engineering runtime = 12.18s ...
(_dystack pid=32908) AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'
(_dystack pid=32908) This metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()
(_dystack pid=32908) To change this, specify the eval_metric parameter of Predictor()
(_dystack pid=32908) Large model count detected (112 configs) ... Only displaying the first 3 models of each family. To see all, set `verbosity=3`.
(_dystack pid=32908) User-specified model hyperparameters to be fit:
(_dystack pid=32908) {
(_dystack pid=32908) 'NN_TORCH': [{}, {'activation': 'elu', 'dropout_prob': 0.10077639529843717, 'hidden_size': 108, 'learning_rate': 0.002735937344002146, 'num_layers': 4, 'use_batchnorm': True, 'weight_decay': 1.356433327634438e-12, 'ag_args': {'name_suffix': '_r79', 'priority': -2}}, {'activation': 'elu', 'dropout_prob': 0.11897478034205347, 'hidden_size': 213, 'learning_rate': 0.0010474382260641949, 'num_layers': 4, 'use_batchnorm': False, 'weight_decay': 5.594471067786272e-10, 'ag_args': {'name_suffix': '_r22', 'priority': -7}}],
(_dystack pid=32908) 'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
(_dystack pid=32908) 'CAT': [{}, {'depth': 6, 'grow_policy': 'SymmetricTree', 'l2_leaf_reg': 2.1542798306067823, 'learning_rate': 0.06864209415792857, 'max_ctr_complexity': 4, 'one_hot_max_size': 10, 'ag_args': {'name_suffix': '_r177', 'priority': -1}}, {'depth': 8, 'grow_policy': 'Depthwise', 'l2_leaf_reg': 2.7997999596449104, 'learning_rate': 0.031375015734637225, 'max_ctr_complexity': 2, 'one_hot_max_size': 3, 'ag_args': {'name_suffix': '_r9', 'priority': -5}}],
(_dystack pid=32908) 'XGB': [{}, {'colsample_bytree': 0.6917311125174739, 'enable_categorical': False, 'learning_rate': 0.018063876087523967, 'max_depth': 10, 'min_child_weight': 0.6028633586934382, 'ag_args': {'name_suffix': '_r33', 'priority': -8}}, {'colsample_bytree': 0.6628423832084077, 'enable_categorical': False, 'learning_rate': 0.08775715546881824, 'max_depth': 5, 'min_child_weight': 0.6294123374222513, 'ag_args': {'name_suffix': '_r89', 'priority': -16}}],
(_dystack pid=32908) 'FASTAI': [{}, {'bs': 256, 'emb_drop': 0.5411770367537934, 'epochs': 43, 'layers': [800, 400], 'lr': 0.01519848858318159, 'ps': 0.23782946566604385, 'ag_args': {'name_suffix': '_r191', 'priority': -4}}, {'bs': 2048, 'emb_drop': 0.05070411322605811, 'epochs': 29, 'layers': [200, 100], 'lr': 0.08974235041576624, 'ps': 0.10393466140748028, 'ag_args': {'name_suffix': '_r102', 'priority': -11}}],
(_dystack pid=32908) 'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
(_dystack pid=32908) 'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
(_dystack pid=32908) 'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
(_dystack pid=32908) }
(_dystack pid=32908) AutoGluon will fit 3 stack levels (L1 to L3) ...
(_dystack pid=32908) Fitting 110 L1 models, fit_strategy="sequential" ...
(_dystack pid=32908) Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 390.73s of the 879.36s of remaining time.
(_dystack pid=32908) 0.5801 = Validation score (roc_auc)
(_dystack pid=32908) 0.08s = Training runtime
(_dystack pid=32908) 0.55s = Validation runtime
(_dystack pid=32908) Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 386.38s of the 875.01s of remaining time.
(_dystack pid=32908) 0.58 = Validation score (roc_auc)
(_dystack pid=32908) 0.08s = Training runtime
(_dystack pid=32908) 0.61s = Validation runtime
(_dystack pid=32908) Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 385.50s of the 874.13s of remaining time.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 13.24% memory usage per fold, 52.94%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=13.24%)
(_dystack pid=32908) 0.6936 = Validation score (roc_auc)
(_dystack pid=32908) 10.9s = Training runtime
(_dystack pid=32908) 0.33s = Validation runtime
(_dystack pid=32908) Fitting model: LightGBM_BAG_L1 ... Training model for up to 369.92s of the 858.54s of remaining time.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 11.69% memory usage per fold, 46.77%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=11.69%)
(_dystack pid=32908) 0.6921 = Validation score (roc_auc)
(_dystack pid=32908) 18.42s = Training runtime
(_dystack pid=32908) 0.52s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 346.88s of the 835.50s of remaining time.
(_dystack pid=32908) 0.6826 = Validation score (roc_auc)
(_dystack pid=32908) 10.74s = Training runtime
(_dystack pid=32908) 6.92s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 327.93s of the 816.56s of remaining time.
(_dystack pid=32908) 0.6837 = Validation score (roc_auc)
(_dystack pid=32908) 9.97s = Training runtime
(_dystack pid=32908) 6.6s = Validation runtime
(_dystack pid=32908) Fitting model: CatBoost_BAG_L1 ... Training model for up to 310.30s of the 798.92s of remaining time.
(_dystack pid=32908) Warning: Potentially not enough memory to safely train model. Estimated to require 1.483 GB out of 1.744 GB available memory (85.036%)... (100.000% of avail memory is the max safe size)
(_dystack pid=32908) To avoid this warning, specify the model hyperparameter "ag.max_memory_usage_ratio" to a larger value (currently 1.0, set to >=1.18 to avoid the warning)
(_dystack pid=32908) To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
(_dystack pid=32908) Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 1 folds in parallel instead (Estimated 85.04% memory usage per fold, 85.04%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (1 workers, per: cpus=6, gpus=0, memory=85.04%)
(_dystack pid=32908) Switching to pseudo sequential ParallelFoldFittingStrategy to avoid Python memory leakage.
(_dystack pid=32908) Overrule this behavior by setting fold_fitting_strategy to 'sequential_local' in ag_args_ensemble when when calling `predictor.fit`
(_ray_fit pid=16808) Warning: Potentially not enough memory to safely train model. Estimated to require 1.461 GB out of 1.515 GB available memory (96.416%)... (100.000% of avail memory is the max safe size)
(_ray_fit pid=16808) To avoid this warning, specify the model hyperparameter "ag.max_memory_usage_ratio" to a larger value (currently 1.0, set to >=1.34 to avoid the warning)
(_ray_fit pid=16808) To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
(_ray_fit pid=16808) Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
(_ray_fit pid=1152) Warning: Potentially not enough memory to safely train model. Estimated to require 1.461 GB out of 1.901 GB available memory (76.844%)... (100.000% of avail memory is the max safe size)
(_ray_fit pid=1152) To avoid this warning, specify the model hyperparameter "ag.max_memory_usage_ratio" to a larger value (currently 1.0, set to >=1.07 to avoid the warning)
(_ray_fit pid=1152) To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
(_ray_fit pid=1152) Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
(_ray_fit pid=22484) Warning: Potentially not enough memory to safely train model. Estimated to require 1.461 GB out of 1.702 GB available memory (85.855%)... (100.000% of avail memory is the max safe size)
(_ray_fit pid=22484) To avoid this warning, specify the model hyperparameter "ag.max_memory_usage_ratio" to a larger value (currently 1.0, set to >=1.19 to avoid the warning)
(_ray_fit pid=22484) To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
(_ray_fit pid=22484) Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
(_ray_fit pid=27396) Warning: Potentially not enough memory to safely train model. Estimated to require 1.461 GB out of 1.553 GB available memory (94.088%)... (100.000% of avail memory is the max safe size)
(_ray_fit pid=27396) To avoid this warning, specify the model hyperparameter "ag.max_memory_usage_ratio" to a larger value (currently 1.0, set to >=1.30 to avoid the warning)
(_ray_fit pid=27396) To set the same value for all models, do the following when calling predictor.fit: `predictor.fit(..., ag_args_fit={"ag.max_memory_usage_ratio": VALUE})`
(_ray_fit pid=27396) Setting "ag.max_memory_usage_ratio" to values above 1 may result in out-of-memory errors. You may consider using a machine with more memory as a safer alternative.
(_ray_fit pid=27396) Iteration 6: Low available memory (<512 MB). Available memory: 489.45 MB, Estimated model size: 12.73 MB.
(_ray_fit pid=27396) Early stopping model prior to optimal result to avoid OOM error. Please increase available memory to avoid subpar model quality.
(_ray_fit pid=27396) Ran low on memory, early stopping on iteration 6.
(_dystack pid=32908) 0.6814 = Validation score (roc_auc)
(_dystack pid=32908) 138.48s = Training runtime
(_dystack pid=32908) 1.48s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesGini_BAG_L1 ... Training model for up to 165.63s of the 654.25s of remaining time.
(_dystack pid=32908) 0.6834 = Validation score (roc_auc)
(_dystack pid=32908) 11.24s = Training runtime
(_dystack pid=32908) 6.37s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesEntr_BAG_L1 ... Training model for up to 146.99s of the 635.61s of remaining time.
(_dystack pid=32908) 0.6774 = Validation score (roc_auc)
(_dystack pid=32908) 11.26s = Training runtime
(_dystack pid=32908) 7.4s = Validation runtime
(_dystack pid=32908) Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 127.45s of the 616.07s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=4.25%)
(_ray_fit pid=26056) No improvement since epoch 0: early stopping
(_dystack pid=32908) 0.6141 = Validation score (roc_auc)
(_dystack pid=32908) 68.74s = Training runtime
(_dystack pid=32908) 1.94s = Validation runtime
(_dystack pid=32908) Fitting model: XGBoost_BAG_L1 ... Training model for up to 50.68s of the 539.30s of remaining time.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 10.06% memory usage per fold, 40.24%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=10.06%)
(_dystack pid=32908) 0.6877 = Validation score (roc_auc)
(_dystack pid=32908) 46.89s = Training runtime
(_dystack pid=32908) 0.74s = Validation runtime
(_ray_fit pid=27516) No improvement since epoch 0: early stopping [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(_dystack pid=32908) Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the 484.12s of remaining time.
(_dystack pid=32908) Ensemble Weights: {'ExtraTreesGini_BAG_L1': 0.25, 'LightGBMXT_BAG_L1': 0.208, 'LightGBM_BAG_L1': 0.167, 'CatBoost_BAG_L1': 0.167, 'RandomForestGini_BAG_L1': 0.083, 'KNeighborsDist_BAG_L1': 0.042, 'RandomForestEntr_BAG_L1': 0.042, 'XGBoost_BAG_L1': 0.042}
(_dystack pid=32908) 0.7013 = Validation score (roc_auc)
(_dystack pid=32908) 0.89s = Training runtime
(_dystack pid=32908) 0.0s = Validation runtime
(_dystack pid=32908) Fitting 108 L2 models, fit_strategy="sequential" ...
(_dystack pid=32908) Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 322.05s of the 482.86s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=8.61%)
(_dystack pid=32908) 0.6942 = Validation score (roc_auc)
(_dystack pid=32908) 6.3s = Training runtime
(_dystack pid=32908) 0.54s = Validation runtime
(_dystack pid=32908) Fitting model: LightGBM_BAG_L2 ... Training model for up to 305.69s of the 466.50s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=6.34%)
(_dystack pid=32908) 0.6928 = Validation score (roc_auc)
(_dystack pid=32908) 5.21s = Training runtime
(_dystack pid=32908) 0.48s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestGini_BAG_L2 ... Training model for up to 291.12s of the 451.93s of remaining time.
(_dystack pid=32908) 0.6872 = Validation score (roc_auc)
(_dystack pid=32908) 6.57s = Training runtime
(_dystack pid=32908) 6.35s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestEntr_BAG_L2 ... Training model for up to 277.06s of the 437.87s of remaining time.
(_dystack pid=32908) 0.6939 = Validation score (roc_auc)
(_dystack pid=32908) 7.45s = Training runtime
(_dystack pid=32908) 10.16s = Validation runtime
(_dystack pid=32908) Fitting model: CatBoost_BAG_L2 ... Training model for up to 258.50s of the 419.31s of remaining time.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 32.49% memory usage per fold, 64.97%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=32.49%)
(_ray_fit pid=32476) Iteration 290: Low available memory (<512 MB). Available memory: 465.66 MB, Estimated model size: 17.73 MB.
(_ray_fit pid=32476) Early stopping model prior to optimal result to avoid OOM error. Please increase available memory to avoid subpar model quality.
(_ray_fit pid=32476) Ran low on memory, early stopping on iteration 290.
(_dystack pid=32908) 0.6982 = Validation score (roc_auc)
(_dystack pid=32908) 56.72s = Training runtime
(_dystack pid=32908) 1.61s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesGini_BAG_L2 ... Training model for up to 194.85s of the 355.66s of remaining time.
(_ray_fit pid=31048) Iteration 280: Low available memory (<512 MB). Available memory: 353.31 MB, Estimated model size: 18.46 MB.
(_ray_fit pid=31048) Early stopping model prior to optimal result to avoid OOM error. Please increase available memory to avoid subpar model quality.
(_ray_fit pid=31048) Ran low on memory, early stopping on iteration 280.
(_dystack pid=32908) 0.6898 = Validation score (roc_auc)
(_dystack pid=32908) 9.48s = Training runtime
(_dystack pid=32908) 6.47s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 178.05s of the 338.87s of remaining time.
(_dystack pid=32908) 0.6918 = Validation score (roc_auc)
(_dystack pid=32908) 8.9s = Training runtime
(_dystack pid=32908) 5.95s = Validation runtime
(_dystack pid=32908) Fitting model: NeuralNetFastAI_BAG_L2 ... Training model for up to 162.42s of the 323.23s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.07%)
(_ray_fit pid=952) No improvement since epoch 0: early stopping
(_dystack pid=32908) 0.6209 = Validation score (roc_auc)
(_dystack pid=32908) 69.11s = Training runtime
(_dystack pid=32908) 1.78s = Validation runtime
(_dystack pid=32908) Fitting model: XGBoost_BAG_L2 ... Training model for up to 82.89s of the 243.70s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=8.99%)
(_dystack pid=32908) 0.6895 = Validation score (roc_auc)
(_dystack pid=32908) 38.12s = Training runtime
(_dystack pid=32908) 0.59s = Validation runtime
(_dystack pid=32908) Fitting model: NeuralNetTorch_BAG_L2 ... Training model for up to 30.20s of the 191.01s of remaining time.
(_ray_fit pid=35136) No improvement since epoch 0: early stopping [repeated 7x across cluster]
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.29%)
(_ray_fit pid=17260) Ran out of time, stopping training early. (Stopping on epoch 11)
(_dystack pid=32908) 0.6951 = Validation score (roc_auc)
(_dystack pid=32908) 30.91s = Training runtime
(_dystack pid=32908) 1.08s = Validation runtime
(_dystack pid=32908) Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.00s of the 150.06s of remaining time.
(_dystack pid=32908) Ensemble Weights: {'CatBoost_BAG_L2': 0.5, 'RandomForestEntr_BAG_L2': 0.167, 'ExtraTreesEntr_BAG_L2': 0.167, 'NeuralNetTorch_BAG_L2': 0.125, 'ExtraTreesGini_BAG_L2': 0.042}
(_dystack pid=32908) 0.6994 = Validation score (roc_auc)
(_dystack pid=32908) 1.07s = Training runtime
(_dystack pid=32908) 0.0s = Validation runtime
(_dystack pid=32908) Fitting 108 L3 models, fit_strategy="sequential" ...
(_dystack pid=32908) Fitting model: LightGBMXT_BAG_L3 ... Training model for up to 148.93s of the 148.39s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=5.04%)
(_ray_fit pid=13300) Ran out of time, stopping training early. (Stopping on epoch 12) [repeated 6x across cluster]
(_dystack pid=32908) 0.6932 = Validation score (roc_auc)
(_dystack pid=32908) 4.76s = Training runtime
(_dystack pid=32908) 0.53s = Validation runtime
(_dystack pid=32908) Fitting model: LightGBM_BAG_L3 ... Training model for up to 135.02s of the 134.48s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=5.36%)
(_dystack pid=32908) 0.6911 = Validation score (roc_auc)
(_dystack pid=32908) 5.12s = Training runtime
(_dystack pid=32908) 0.51s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestGini_BAG_L3 ... Training model for up to 121.23s of the 120.69s of remaining time.
(_dystack pid=32908) 0.687 = Validation score (roc_auc)
(_dystack pid=32908) 6.62s = Training runtime
(_dystack pid=32908) 5.53s = Validation runtime
(_dystack pid=32908) Fitting model: RandomForestEntr_BAG_L3 ... Training model for up to 108.11s of the 107.57s of remaining time.
(_dystack pid=32908) 0.6861 = Validation score (roc_auc)
(_dystack pid=32908) 7.17s = Training runtime
(_dystack pid=32908) 5.43s = Validation runtime
(_dystack pid=32908) Fitting model: CatBoost_BAG_L3 ... Training model for up to 94.80s of the 94.26s of remaining time.
(_dystack pid=32908) Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 28.45% memory usage per fold, 56.91%/80.00% total).
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=28.45%)
(_dystack pid=32908) 0.6971 = Validation score (roc_auc)
(_dystack pid=32908) 45.12s = Training runtime
(_dystack pid=32908) 1.35s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesGini_BAG_L3 ... Training model for up to 44.97s of the 44.43s of remaining time.
(_dystack pid=32908) 0.6884 = Validation score (roc_auc)
(_dystack pid=32908) 7.48s = Training runtime
(_dystack pid=32908) 5.59s = Validation runtime
(_dystack pid=32908) Fitting model: ExtraTreesEntr_BAG_L3 ... Training model for up to 31.20s of the 30.66s of remaining time.
(_dystack pid=32908) 0.6883 = Validation score (roc_auc)
(_dystack pid=32908) 8.26s = Training runtime
(_dystack pid=32908) 5.63s = Validation runtime
(_dystack pid=32908) Fitting model: NeuralNetFastAI_BAG_L3 ... Training model for up to 16.51s of the 15.97s of remaining time.
(_dystack pid=32908) Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.51%)
(_dystack pid=32908) Time limit exceeded... Skipping NeuralNetFastAI_BAG_L3.
(_dystack pid=32908) Fitting model: WeightedEnsemble_L4 ... Training model for up to 360.00s of the -6.83s of remaining time.
(_dystack pid=32908) Ensemble Weights: {'LightGBMXT_BAG_L1': 0.28, 'CatBoost_BAG_L2': 0.16, 'CatBoost_BAG_L1': 0.12, 'RandomForestEntr_BAG_L2': 0.12, 'ExtraTreesEntr_BAG_L2': 0.12, 'ExtraTreesGini_BAG_L3': 0.08, 'KNeighborsDist_BAG_L1': 0.04, 'XGBoost_BAG_L1': 0.04, 'CatBoost_BAG_L3': 0.04}
(_dystack pid=32908) 0.7005 = Validation score (roc_auc)
(_dystack pid=32908) 1.19s = Training runtime
(_dystack pid=32908) 0.0s = Validation runtime
(_dystack pid=32908) AutoGluon training complete, total runtime = 899.64s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 259.6 rows/s (1464 batch size)
(_dystack pid=32908) TabularPredictor saved. To load, use: predictor = TabularPredictor.load("c:\studium\DAV Ausbildung\CADS\Immersion\ads3_exam\03_scripts\autogluon_diabetes_auc\ds_sub_fit\sub_fit_ho")
(_dystack pid=32908) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
(_dystack pid=32908) Deleting DyStack predictor artifacts (clean_up_fits=True) ...
Leaderboard on holdout data (DyStack):
model score_holdout score_val eval_metric pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBM_BAG_L1 0.702035 0.692062 roc_auc 0.720246 0.524789 18.419194 0.720246 0.524789 18.419194 1 True 4
1 XGBoost_BAG_L1 0.700820 0.687669 roc_auc 1.123045 0.739026 46.888627 1.123045 0.739026 46.888627 1 True 11
2 LightGBM_BAG_L2 0.700251 0.692762 roc_auc 9.176113 33.953432 332.016193 0.492544 0.482172 5.213061 2 True 14
3 XGBoost_BAG_L2 0.699033 0.689484 roc_auc 9.603825 34.059568 364.922395 0.920256 0.588307 38.119262 2 True 21
4 LightGBMXT_BAG_L1 0.698827 0.693619 roc_auc 0.971671 0.333140 10.899239 0.971671 0.333140 10.899239 1 True 3
5 LightGBMXT_BAG_L2 0.698107 0.694226 roc_auc 9.206035 34.011583 333.101941 0.522466 0.540322 6.298809 2 True 13
6 CatBoost_BAG_L2 0.697666 0.698169 roc_auc 9.365991 35.083744 383.523052 0.682422 1.612483 56.719920 2 True 17
7 WeightedEnsemble_L4 0.697211 0.700509 roc_auc 16.088237 75.411415 619.356147 0.034240 0.000000 1.187552 4 True 31
8 WeightedEnsemble_L2 0.696547 0.701327 roc_auc 6.046668 23.579976 247.610606 0.025848 0.002013 0.887570 2 True 12
9 CatBoost_BAG_L3 0.695823 0.697128 roc_auc 15.596728 69.818422 610.692262 0.853104 1.345728 45.115420 3 True 28
10 LightGBMXT_BAG_L3 0.695640 0.693224 roc_auc 15.478703 69.005402 570.340166 0.735079 0.532707 4.763324 3 True 24
11 WeightedEnsemble_L3 0.695121 0.699397 roc_auc 11.411233 58.739377 441.333209 0.030765 0.003989 1.065155 3 True 23
12 NeuralNetTorch_BAG_L2 0.694657 0.695138 roc_auc 9.619292 34.548492 357.708633 0.935724 1.077231 30.905501 2 True 22
13 CatBoost_BAG_L1 0.694515 0.681403 roc_auc 1.753284 1.479760 138.480240 1.753284 1.479760 138.480240 1 True 7
14 LightGBM_BAG_L3 0.694351 0.691087 roc_auc 15.433444 68.987542 570.697489 0.689819 0.514848 5.120647 3 True 25
15 ExtraTreesEntr_BAG_L3 0.693597 0.688321 roc_auc 15.270735 74.106896 573.833061 0.527111 5.634202 8.256219 3 True 30
16 RandomForestEntr_BAG_L2 0.692863 0.693938 roc_auc 9.030982 43.630247 334.254858 0.347413 10.158986 7.451726 2 True 16
17 RandomForestEntr_BAG_L3 0.692686 0.686109 roc_auc 15.091510 73.904293 572.747903 0.347885 5.431598 7.171061 3 True 27
18 RandomForestGini_BAG_L3 0.691675 0.687028 roc_auc 15.188117 74.001036 572.201475 0.444493 5.528342 6.624633 3 True 26
19 ExtraTreesGini_BAG_L3 0.685558 0.688437 roc_auc 15.200894 74.065688 573.053175 0.457269 5.592993 7.476333 3 True 29
20 ExtraTreesGini_BAG_L2 0.684813 0.689820 roc_auc 9.036015 39.937354 336.287470 0.352446 6.466093 9.484338 2 True 18
21 RandomForestGini_BAG_L2 0.684480 0.687225 roc_auc 9.007171 39.819417 333.372889 0.323603 6.348156 6.569757 2 True 15
22 ExtraTreesEntr_BAG_L2 0.683113 0.691798 roc_auc 9.062463 39.420594 335.706569 0.378895 5.949334 8.903437 2 True 19
23 RandomForestGini_BAG_L1 0.676770 0.682589 roc_auc 0.447961 6.922239 10.741470 0.447961 6.922239 10.741470 1 True 5
24 RandomForestEntr_BAG_L1 0.674646 0.683656 roc_auc 0.433249 6.597428 9.971981 0.433249 6.597428 9.971981 1 True 6
25 ExtraTreesGini_BAG_L1 0.673604 0.683448 roc_auc 0.460975 6.374763 11.239830 0.460975 6.374763 11.239830 1 True 8
26 ExtraTreesEntr_BAG_L1 0.666480 0.677444 roc_auc 0.429228 7.400769 11.257731 0.429228 7.400769 11.257731 1 True 9
27 NeuralNetFastAI_BAG_L2 0.661369 0.620925 roc_auc 9.787857 35.249610 395.911031 1.104288 1.778350 69.107899 2 True 20
28 NeuralNetFastAI_BAG_L1 0.646361 0.614142 roc_auc 2.099842 1.943866 68.738361 2.099842 1.943866 68.738361 1 True 10
29 KNeighborsUnif_BAG_L1 0.578671 0.580084 roc_auc 0.133679 0.548662 0.084005 0.133679 0.548662 0.084005 1 True 1
30 KNeighborsDist_BAG_L1 0.571540 0.579972 roc_auc 0.110390 0.606818 0.082456 0.110390 0.606818 0.082456 1 True 2
2 = Optimal num_stack_levels (Stacked Overfitting Occurred: False)
934s = DyStack runtime | 2666s = Remaining runtime
Starting main fit with num_stack_levels=2.
For future fit calls on this dataset, you can skip DyStack to save time: `predictor.fit(..., dynamic_stacking=False, num_stack_levels=2)`
Beginning AutoGluon training ... Time limit = 2666s
AutoGluon will save models to "c:\studium\DAV Ausbildung\CADS\Immersion\ads3_exam\03_scripts\autogluon_diabetes_auc"
Train Data Rows: 13176
Train Data Columns: 2248
Label Column: TARGET
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 2794.30 MB
Train Data (Original) Memory Usage: 225.98 MB (8.1% of available memory)
Warning: Data size prior to feature transformation consumes 8.1% of available memory. Consider increasing memory or subsampling the data to avoid instability.
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1712 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Useless Original Features (Count: 528): ['diag_1_110', 'diag_1_143', 'diag_1_147', 'diag_1_149', 'diag_1_160', 'diag_1_170', 'diag_1_173', 'diag_1_175', 'diag_1_187', 'diag_1_192', 'diag_1_194', 'diag_1_195', 'diag_1_200', 'diag_1_201', 'diag_1_207', 'diag_1_208', 'diag_1_210', 'diag_1_216', 'diag_1_219', 'diag_1_236', 'diag_1_246', 'diag_1_25021', 'diag_1_25032', 'diag_1_2505', 'diag_1_25053', 'diag_1_2509', 'diag_1_25091', 'diag_1_262', 'diag_1_266', 'diag_1_272', 'diag_1_301', 'diag_1_308', 'diag_1_318', 'diag_1_320', 'diag_1_322', 'diag_1_334', 'diag_1_335', 'diag_1_336', 'diag_1_337', 'diag_1_353', 'diag_1_36', 'diag_1_362', 'diag_1_363', 'diag_1_374', 'diag_1_377', 'diag_1_381', 'diag_1_382', 'diag_1_383', 'diag_1_385', 'diag_1_388', 'diag_1_39', 'diag_1_397', 'diag_1_412', 'diag_1_422', 'diag_1_445', 'diag_1_448', 'diag_1_457', 'diag_1_470', 'diag_1_477', 'diag_1_48', 'diag_1_49', 'diag_1_495', 'diag_1_501', 'diag_1_52', 'diag_1_521', 'diag_1_524', 'diag_1_526', 'diag_1_527', 'diag_1_528', 'diag_1_529', 'diag_1_542', 'diag_1_551', 'diag_1_565', 'diag_1_57', 'diag_1_573', 'diag_1_580', 'diag_1_582', 'diag_1_588', 'diag_1_603', 'diag_1_605', 'diag_1_61', 'diag_1_610', 'diag_1_632', 'diag_1_634', 'diag_1_643', 'diag_1_645', 'diag_1_646', 'diag_1_649', 'diag_1_653', 'diag_1_657', 'diag_1_665', 'diag_1_669', 'diag_1_671', 'diag_1_683', 'diag_1_684', 'diag_1_685', 'diag_1_690', 'diag_1_692', 'diag_1_693', 'diag_1_695', 'diag_1_7', 'diag_1_706', 'diag_1_709', 'diag_1_732', 'diag_1_734', 'diag_1_735', 'diag_1_745', 'diag_1_746', 'diag_1_747', 'diag_1_75', 'diag_1_791', 'diag_1_792', 'diag_1_795', 'diag_1_800', 'diag_1_810', 'diag_1_817', 'diag_1_831', 'diag_1_832', 'diag_1_835', 'diag_1_836', 'diag_1_84', 'diag_1_840', 'diag_1_845', 'diag_1_846', 'diag_1_848', 'diag_1_854', 'diag_1_862', 'diag_1_863', 'diag_1_868', 'diag_1_870', 'diag_1_871', 'diag_1_875', 'diag_1_879', 'diag_1_885', 'diag_1_886', 'diag_1_891', 'diag_1_892', 'diag_1_893', 'diag_1_897', 'diag_1_903', 'diag_1_906', 'diag_1_911', 'diag_1_914', 'diag_1_916', 'diag_1_921', 'diag_1_934', 'diag_1_936', 'diag_1_941', 'diag_1_955', 'diag_1_957', 'diag_1_963', 'diag_1_968', 'diag_1_976', 'diag_1_977', 'diag_1_98', 'diag_1_982', 'diag_1_983', 'diag_1_987', 'diag_1_E909', 'diag_1_V45', 'diag_1_V51', 'diag_1_V56', 'diag_1_V70', 'diag_2_111', 'diag_2_123', 'diag_2_131', 'diag_2_137', 'diag_2_145', 'diag_2_152', 'diag_2_163', 'diag_2_164', 'diag_2_173', 'diag_2_180', 'diag_2_192', 'diag_2_212', 'diag_2_217', 'diag_2_225', 'diag_2_228', 'diag_2_235', 'diag_2_239', 'diag_2_240', 'diag_2_259', 'diag_2_268', 'diag_2_269', 'diag_2_27', 'diag_2_271', 'diag_2_281', 'diag_2_297', 'diag_2_306', 'diag_2_31', 'diag_2_310', 'diag_2_316', 'diag_2_324', 'diag_2_338', 'diag_2_34', 'diag_2_347', 'diag_2_35', 'diag_2_351', 'diag_2_352', 'diag_2_355', 'diag_2_369', 'diag_2_373', 'diag_2_374', 'diag_2_376', 'diag_2_379', 'diag_2_382', 'diag_2_383', 'diag_2_389', 'diag_2_394', 'diag_2_395', 'diag_2_422', 'diag_2_430', 'diag_2_442', 'diag_2_464', 'diag_2_470', 'diag_2_484', 'diag_2_514', 'diag_2_520', 'diag_2_521', 'diag_2_522', 'diag_2_523', 'diag_2_527', 'diag_2_529', 'diag_2_534', 'diag_2_550', 'diag_2_557', 'diag_2_570', 'diag_2_579', 'diag_2_580', 'diag_2_588', 'diag_2_598', 'diag_2_604', 'diag_2_605', 'diag_2_615', 'diag_2_619', 'diag_2_622', 'diag_2_623', 'diag_2_627', 'diag_2_634', 'diag_2_641', 'diag_2_644', 'diag_2_645', 'diag_2_652', 'diag_2_658', 'diag_2_659', 'diag_2_661', 'diag_2_663', 'diag_2_664', 'diag_2_665', 'diag_2_674', 'diag_2_683', 'diag_2_686', 'diag_2_702', 'diag_2_703', 'diag_2_704', 'diag_2_705', 'diag_2_706', 'diag_2_713', 'diag_2_725', 'diag_2_734', 'diag_2_741', 'diag_2_747', 'diag_2_748', 'diag_2_751', 'diag_2_753', 'diag_2_758', 'diag_2_759', 'diag_2_795', 'diag_2_797', 'diag_2_800', 'diag_2_801', 'diag_2_814', 'diag_2_815', 'diag_2_831', 'diag_2_832', 'diag_2_837', 'diag_2_843', 'diag_2_861', 'diag_2_863', 'diag_2_867', 'diag_2_868', 'diag_2_870', 'diag_2_872', 'diag_2_880', 'diag_2_906', 'diag_2_910', 'diag_2_913', 'diag_2_915', 'diag_2_918', 'diag_2_922', 'diag_2_927', 'diag_2_933', 'diag_2_942', 'diag_2_944', 'diag_2_947', 'diag_2_952', 'diag_2_953', 'diag_2_962', 'diag_2_965', 'diag_2_967', 'diag_2_99', 'diag_2_991', 'diag_2_992', 'diag_2_E816', 'diag_2_E819', 'diag_2_E821', 'diag_2_E850', 'diag_2_E868', 'diag_2_E882', 'diag_2_E883', 'diag_2_E887', 'diag_2_E890', 'diag_2_E905', 'diag_2_E906', 'diag_2_E924', 'diag_2_E928', 'diag_2_E931', 'diag_2_E936', 'diag_2_E938', 'diag_2_E944', 'diag_2_E965', 'diag_2_E968', 'diag_2_E980', 'diag_2_V11', 'diag_2_V13', 'diag_2_V16', 'diag_2_V23', 'diag_2_V25', 'diag_2_V46', 'diag_2_V50', 'diag_2_V53', 'diag_2_V57', 'diag_2_V60', 'diag_2_V61', 'diag_2_V86', 'diag_3_115', 'diag_3_122', 'diag_3_123', 'diag_3_136', 'diag_3_139', 'diag_3_141', 'diag_3_150', 'diag_3_152', 'diag_3_161', 'diag_3_163', 'diag_3_17', 'diag_3_170', 'diag_3_171', 'diag_3_172', 'diag_3_182', 'diag_3_193', 'diag_3_214', 'diag_3_217', 'diag_3_223', 'diag_3_227', 'diag_3_233', 'diag_3_235', 'diag_3_240', 'diag_3_245', 'diag_3_25022', 'diag_3_2503', 'diag_3_25031', 'diag_3_258', 'diag_3_259', 'diag_3_265', 'diag_3_270', 'diag_3_271', 'diag_3_299', 'diag_3_314', 'diag_3_315', 'diag_3_318', 'diag_3_323', 'diag_3_35', 'diag_3_350', 'diag_3_359', 'diag_3_360', 'diag_3_361', 'diag_3_366', 'diag_3_370', 'diag_3_373', 'diag_3_374', 'diag_3_376', 'diag_3_379', 'diag_3_380', 'diag_3_381', 'diag_3_382', 'diag_3_383', 'diag_3_385', 'diag_3_387', 'diag_3_388', 'diag_3_395', 'diag_3_405', 'diag_3_417', 'diag_3_421', 'diag_3_430', 'diag_3_432', 'diag_3_436', 'diag_3_460', 'diag_3_463', 'diag_3_464', 'diag_3_47', 'diag_3_470', 'diag_3_472', 'diag_3_480', 'diag_3_49', 'diag_3_5', 'diag_3_500', 'diag_3_501', 'diag_3_506', 'diag_3_508', 'diag_3_514', 'diag_3_516', 'diag_3_523', 'diag_3_524', 'diag_3_525', 'diag_3_527', 'diag_3_534', 'diag_3_540', 'diag_3_542', 'diag_3_543', 'diag_3_556', 'diag_3_565', 'diag_3_566', 'diag_3_598', 'diag_3_608', 'diag_3_624', 'diag_3_641', 'diag_3_647', 'diag_3_655', 'diag_3_657', 'diag_3_665', 'diag_3_669', 'diag_3_674', 'diag_3_685', 'diag_3_694', 'diag_3_702', 'diag_3_706', 'diag_3_712', 'diag_3_734', 'diag_3_735', 'diag_3_741', 'diag_3_744', 'diag_3_746', 'diag_3_747', 'diag_3_750', 'diag_3_754', 'diag_3_755', 'diag_3_758', 'diag_3_759', 'diag_3_78', 'diag_3_801', 'diag_3_810', 'diag_3_814', 'diag_3_822', 'diag_3_826', 'diag_3_831', 'diag_3_834', 'diag_3_844', 'diag_3_845', 'diag_3_847', 'diag_3_851', 'diag_3_853', 'diag_3_854', 'diag_3_864', 'diag_3_866', 'diag_3_868', 'diag_3_871', 'diag_3_872', 'diag_3_876', 'diag_3_877', 'diag_3_881', 'diag_3_883', 'diag_3_890', 'diag_3_906', 'diag_3_910', 'diag_3_911', 'diag_3_918', 'diag_3_919', 'diag_3_930', 'diag_3_933', 'diag_3_943', 'diag_3_945', 'diag_3_951', 'diag_3_952', 'diag_3_953', 'diag_3_955', 'diag_3_956', 'diag_3_962', 'diag_3_965', 'diag_3_966', 'diag_3_967', 'diag_3_969', 'diag_3_970', 'diag_3_972', 'diag_3_987', 'diag_3_989', 'diag_3_E815', 'diag_3_E817', 'diag_3_E818', 'diag_3_E819', 'diag_3_E822', 'diag_3_E825', 'diag_3_E826', 'diag_3_E828', 'diag_3_E853', 'diag_3_E858', 'diag_3_E861', 'diag_3_E882', 'diag_3_E883', 'diag_3_E905', 'diag_3_E906', 'diag_3_E916', 'diag_3_E917', 'diag_3_E924', 'diag_3_E941', 'diag_3_E943', 'diag_3_E949', 'diag_3_E966', 'diag_3_E980', 'diag_3_V02', 'diag_3_V03', 'diag_3_V13', 'diag_3_V16', 'diag_3_V23', 'diag_3_V57', 'diag_3_V61', 'diag_3_V66', 'diag_3_V86']
These features carry no predictive signal and should be manually investigated.
This is typically a feature which has the same value for all rows.
These features do not need to be present at inference time.
Unused Original Features (Count: 626): ['admission_type_id_7', 'discharge_disposition_id_24', 'admission_source_id_8', 'payer_code_MP', 'medical_specialty_Gynecology', 'medical_specialty_Obstetrics', 'medical_specialty_PediatricsCriticalCare', 'medical_specialty_Podiatry', 'diag_1_11', 'diag_1_117', 'diag_1_133', 'diag_1_136', 'diag_1_141', 'diag_1_142', 'diag_1_146', 'diag_1_152', 'diag_1_158', 'diag_1_161', 'diag_1_179', 'diag_1_180', 'diag_1_184', 'diag_1_196', 'diag_1_199', 'diag_1_204', 'diag_1_205', 'diag_1_212', 'diag_1_215', 'diag_1_223', 'diag_1_229', 'diag_1_233', 'diag_1_237', 'diag_1_239', 'diag_1_242', 'diag_1_245', 'diag_1_2503', 'diag_1_25033', 'diag_1_25043', 'diag_1_25051', 'diag_1_251', 'diag_1_253', 'diag_1_255', 'diag_1_263', 'diag_1_273', 'diag_1_275', 'diag_1_282', 'diag_1_283', 'diag_1_288', 'diag_1_289', 'diag_1_3', 'diag_1_303', 'diag_1_304', 'diag_1_305', 'diag_1_307', 'diag_1_312', 'diag_1_314', 'diag_1_325', 'diag_1_34', 'diag_1_340', 'diag_1_342', 'diag_1_344', 'diag_1_345', 'diag_1_35', 'diag_1_352', 'diag_1_359', 'diag_1_360', 'diag_1_368', 'diag_1_370', 'diag_1_378', 'diag_1_379', 'diag_1_380', 'diag_1_384', 'diag_1_391', 'diag_1_394', 'diag_1_395', 'diag_1_396', 'diag_1_405', 'diag_1_41', 'diag_1_416', 'diag_1_42', 'diag_1_430', 'diag_1_442', 'diag_1_451', 'diag_1_452', 'diag_1_454', 'diag_1_456', 'diag_1_459', 'diag_1_462', 'diag_1_463', 'diag_1_464', 'diag_1_474', 'diag_1_478', 'diag_1_483', 'diag_1_492', 'diag_1_5', 'diag_1_508', 'diag_1_512', 'diag_1_514', 'diag_1_515', 'diag_1_516', 'diag_1_522', 'diag_1_533', 'diag_1_556', 'diag_1_568', 'diag_1_570', 'diag_1_579', 'diag_1_581', 'diag_1_591', 'diag_1_594', 'diag_1_598', 'diag_1_601', 'diag_1_602', 'diag_1_607', 'diag_1_611', 'diag_1_614', 'diag_1_617', 'diag_1_619', 'diag_1_620', 'diag_1_623', 'diag_1_627', 'diag_1_637', 'diag_1_641', 'diag_1_642', 'diag_1_644', 'diag_1_655', 'diag_1_659', 'diag_1_661', 'diag_1_663', 'diag_1_674', 'diag_1_686', 'diag_1_696', 'diag_1_70', 'diag_1_705', 'diag_1_708', 'diag_1_710', 'diag_1_714', 'diag_1_717', 'diag_1_720', 'diag_1_727', 'diag_1_736', 'diag_1_737', 'diag_1_753', 'diag_1_756', 'diag_1_759', 'diag_1_78', 'diag_1_788', 'diag_1_793', 'diag_1_796', 'diag_1_797', 'diag_1_801', 'diag_1_806', 'diag_1_816', 'diag_1_82', 'diag_1_826', 'diag_1_833', 'diag_1_843', 'diag_1_847', 'diag_1_853', 'diag_1_864', 'diag_1_865', 'diag_1_866', 'diag_1_867', 'diag_1_873', 'diag_1_880', 'diag_1_882', 'diag_1_890', 'diag_1_9', 'diag_1_913', 'diag_1_915', 'diag_1_922', 'diag_1_923', 'diag_1_924', 'diag_1_928', 'diag_1_933', 'diag_1_94', 'diag_1_952', 'diag_1_964', 'diag_1_965', 'diag_1_966', 'diag_1_970', 'diag_1_972', 'diag_1_986', 'diag_1_991', 'diag_1_992', 'diag_1_994', 'diag_1_999', 'diag_1_V26', 'diag_1_V55', 'diag_1_V60', 'diag_1_V71', 'diag_2_110', 'diag_2_114', 'diag_2_117', 'diag_2_135', 'diag_2_136', 'diag_2_140', 'diag_2_151', 'diag_2_155', 'diag_2_172', 'diag_2_179', 'diag_2_183', 'diag_2_188', 'diag_2_193', 'diag_2_200', 'diag_2_201', 'diag_2_208', 'diag_2_214', 'diag_2_215', 'diag_2_218', 'diag_2_220', 'diag_2_226', 'diag_2_227', 'diag_2_238', 'diag_2_246', 'diag_2_2501', 'diag_2_2502', 'diag_2_25022', 'diag_2_25033', 'diag_2_2505', 'diag_2_25052', 'diag_2_25053', 'diag_2_25083', 'diag_2_2509', 'diag_2_25091', 'diag_2_25093', 'diag_2_251', 'diag_2_252', 'diag_2_253', 'diag_2_258', 'diag_2_260', 'diag_2_261', 'diag_2_262', 'diag_2_273', 'diag_2_275', 'diag_2_279', 'diag_2_282', 'diag_2_283', 'diag_2_298', 'diag_2_301', 'diag_2_302', 'diag_2_307', 'diag_2_308', 'diag_2_317', 'diag_2_320', 'diag_2_322', 'diag_2_323', 'diag_2_327', 'diag_2_333', 'diag_2_344', 'diag_2_346', 'diag_2_353', 'diag_2_354', 'diag_2_360', 'diag_2_368', 'diag_2_372', 'diag_2_377', 'diag_2_380', 'diag_2_381', 'diag_2_386', 'diag_2_405', 'diag_2_421', 'diag_2_423', 'diag_2_436', 'diag_2_443', 'diag_2_447', 'diag_2_451', 'diag_2_452', 'diag_2_454', 'diag_2_455', 'diag_2_457', 'diag_2_459', 'diag_2_46', 'diag_2_460', 'diag_2_461', 'diag_2_462', 'diag_2_463', 'diag_2_472', 'diag_2_474', 'diag_2_485', 'diag_2_487', 'diag_2_494', 'diag_2_495', 'diag_2_500', 'diag_2_501', 'diag_2_508', 'diag_2_517', 'diag_2_528', 'diag_2_53', 'diag_2_533', 'diag_2_54', 'diag_2_540', 'diag_2_543', 'diag_2_555', 'diag_2_566', 'diag_2_586', 'diag_2_594', 'diag_2_600', 'diag_2_601', 'diag_2_602', 'diag_2_603', 'diag_2_611', 'diag_2_614', 'diag_2_617', 'diag_2_620', 'diag_2_621', 'diag_2_626', 'diag_2_646', 'diag_2_647', 'diag_2_649', 'diag_2_654', 'diag_2_681', 'diag_2_691', 'diag_2_692', 'diag_2_693', 'diag_2_696', 'diag_2_698', 'diag_2_701', 'diag_2_709', 'diag_2_712', 'diag_2_717', 'diag_2_718', 'diag_2_736', 'diag_2_738', 'diag_2_742', 'diag_2_746', 'diag_2_752', 'diag_2_755', 'diag_2_756', 'diag_2_783', 'diag_2_79', 'diag_2_791', 'diag_2_792', 'diag_2_793', 'diag_2_796', 'diag_2_802', 'diag_2_820', 'diag_2_825', 'diag_2_833', 'diag_2_836', 'diag_2_840', 'diag_2_842', 'diag_2_844', 'diag_2_847', 'diag_2_852', 'diag_2_853', 'diag_2_864', 'diag_2_865', 'diag_2_866', 'diag_2_871', 'diag_2_873', 'diag_2_88', 'diag_2_881', 'diag_2_882', 'diag_2_883', 'diag_2_891', 'diag_2_892', 'diag_2_9', 'diag_2_907', 'diag_2_911', 'diag_2_916', 'diag_2_919', 'diag_2_920', 'diag_2_921', 'diag_2_945', 'diag_2_948', 'diag_2_955', 'diag_2_958', 'diag_2_969', 'diag_2_972', 'diag_2_980', 'diag_2_989', 'diag_2_999', 'diag_2_E814', 'diag_2_E826', 'diag_2_E858', 'diag_2_E879', 'diag_2_E880', 'diag_2_E881', 'diag_2_E916', 'diag_2_E918', 'diag_2_E919', 'diag_2_E927', 'diag_2_E932', 'diag_2_E933', 'diag_2_E935', 'diag_2_E937', 'diag_2_E947', 'diag_2_E950', 'diag_2_V02', 'diag_2_V03', 'diag_2_V09', 'diag_2_V10', 'diag_2_V14', 'diag_2_V17', 'diag_2_V44', 'diag_2_V63', 'diag_2_V65', 'diag_2_V70', 'diag_3_117', 'diag_3_131', 'diag_3_135', 'diag_3_14', 'diag_3_151', 'diag_3_153', 'diag_3_154', 'diag_3_156', 'diag_3_157', 'diag_3_158', 'diag_3_173', 'diag_3_179', 'diag_3_180', 'diag_3_183', 'diag_3_188', 'diag_3_189', 'diag_3_200', 'diag_3_201', 'diag_3_202', 'diag_3_208', 'diag_3_220', 'diag_3_228', 'diag_3_239', 'diag_3_242', 'diag_3_2501', 'diag_3_25011', 'diag_3_25013', 'diag_3_2502', 'diag_3_25083', 'diag_3_2509', 'diag_3_25091', 'diag_3_25093', 'diag_3_251', 'diag_3_252', 'diag_3_256', 'diag_3_262', 'diag_3_266', 'diag_3_268', 'diag_3_27', 'diag_3_273', 'diag_3_277', 'diag_3_279', 'diag_3_281', 'diag_3_282', 'diag_3_283', 'diag_3_289', 'diag_3_293', 'diag_3_297', 'diag_3_298', 'diag_3_306', 'diag_3_312', 'diag_3_313', 'diag_3_334', 'diag_3_335', 'diag_3_336', 'diag_3_338', 'diag_3_341', 'diag_3_343', 'diag_3_347', 'diag_3_348', 'diag_3_351', 'diag_3_354', 'diag_3_355', 'diag_3_356', 'diag_3_358', 'diag_3_36544', 'diag_3_369', 'diag_3_378', 'diag_3_384', 'diag_3_386', 'diag_3_394', 'diag_3_398', 'diag_3_423', 'diag_3_431', 'diag_3_434', 'diag_3_446', 'diag_3_451', 'diag_3_454', 'diag_3_462', 'diag_3_477', 'diag_3_478', 'diag_3_494', 'diag_3_495', 'diag_3_510', 'diag_3_512', 'diag_3_517', 'diag_3_522', 'diag_3_529', 'diag_3_53', 'diag_3_532', 'diag_3_533', 'diag_3_537', 'diag_3_54', 'diag_3_550', 'diag_3_555', 'diag_3_57', 'diag_3_570', 'diag_3_572', 'diag_3_575', 'diag_3_579', 'diag_3_580', 'diag_3_582', 'diag_3_586', 'diag_3_594', 'diag_3_602', 'diag_3_604', 'diag_3_605', 'diag_3_607', 'diag_3_614', 'diag_3_619', 'diag_3_623', 'diag_3_627', 'diag_3_642', 'diag_3_644', 'diag_3_646', 'diag_3_653', 'diag_3_659', 'diag_3_661', 'diag_3_663', 'diag_3_664', 'diag_3_670', 'diag_3_686', 'diag_3_690', 'diag_3_692', 'diag_3_693', 'diag_3_695', 'diag_3_697', 'diag_3_698', 'diag_3_701', 'diag_3_703', 'diag_3_704', 'diag_3_705', 'diag_3_708', 'diag_3_709', 'diag_3_710', 'diag_3_713', 'diag_3_717', 'diag_3_718', 'diag_3_721', 'diag_3_723', 'diag_3_725', 'diag_3_726', 'diag_3_736', 'diag_3_737', 'diag_3_745', 'diag_3_751', 'diag_3_752', 'diag_3_79', 'diag_3_792', 'diag_3_793', 'diag_3_795', 'diag_3_800', 'diag_3_805', 'diag_3_808', 'diag_3_816', 'diag_3_820', 'diag_3_823', 'diag_3_825', 'diag_3_836', 'diag_3_838', 'diag_3_842', 'diag_3_850', 'diag_3_852', 'diag_3_860', 'diag_3_862', 'diag_3_865', 'diag_3_873', 'diag_3_875', 'diag_3_880', 'diag_3_891', 'diag_3_9', 'diag_3_905', 'diag_3_907', 'diag_3_908', 'diag_3_912', 'diag_3_913', 'diag_3_917', 'diag_3_920', 'diag_3_921', 'diag_3_922', 'diag_3_924', 'diag_3_928', 'diag_3_934', 'diag_3_948', 'diag_3_958', 'diag_3_971', 'diag_3_991', 'diag_3_999', 'diag_3_E812', 'diag_3_E813', 'diag_3_E850', 'diag_3_E852', 'diag_3_E870', 'diag_3_E886', 'diag_3_E892', 'diag_3_E894', 'diag_3_E900', 'diag_3_E912', 'diag_3_E919', 'diag_3_E920', 'diag_3_E922', 'diag_3_E928', 'diag_3_E930', 'diag_3_E931', 'diag_3_E932', 'diag_3_E933', 'diag_3_E934', 'diag_3_E936', 'diag_3_E937', 'diag_3_E938', 'diag_3_E946', 'diag_3_E950', 'diag_3_E956', 'diag_3_V06', 'diag_3_V14', 'diag_3_V17', 'diag_3_V25', 'diag_3_V44', 'diag_3_V55', 'diag_3_V62', 'diag_3_V70', 'diag_3_V72', 'nateglinide_Up', 'chlorpropamide_Steady', 'tolbutamide_Steady', 'miglitol_Steady', 'group_diag_1_na', 'group_diag_2_na', 'group_diag_3_na']
These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
These features do not need to be present at inference time.
('float', []) : 626 | ['admission_type_id_7', 'discharge_disposition_id_24', 'admission_source_id_8', 'payer_code_MP', 'medical_specialty_Gynecology', ...]
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1094 | ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', ...]
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 8 | ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', ...]
('int', ['bool']) : 1086 | ['race_AfricanAmerican', 'race_Asian', 'race_Caucasian', 'race_Hispanic', 'race_Other', ...]
16.2s = Fit runtime
1094 features in original data used to generate 1094 features in processed data.
Train Data (Processed) Memory Usage: 14.45 MB (0.5% of available memory)
Data preprocessing and feature engineering runtime = 16.94s ...
AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'
This metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()
To change this, specify the eval_metric parameter of Predictor()
Large model count detected (112 configs) ... Only displaying the first 3 models of each family. To see all, set `verbosity=3`.
User-specified model hyperparameters to be fit:
{
'NN_TORCH': [{}, {'activation': 'elu', 'dropout_prob': 0.10077639529843717, 'hidden_size': 108, 'learning_rate': 0.002735937344002146, 'num_layers': 4, 'use_batchnorm': True, 'weight_decay': 1.356433327634438e-12, 'ag_args': {'name_suffix': '_r79', 'priority': -2}}, {'activation': 'elu', 'dropout_prob': 0.11897478034205347, 'hidden_size': 213, 'learning_rate': 0.0010474382260641949, 'num_layers': 4, 'use_batchnorm': False, 'weight_decay': 5.594471067786272e-10, 'ag_args': {'name_suffix': '_r22', 'priority': -7}}],
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
'CAT': [{}, {'depth': 6, 'grow_policy': 'SymmetricTree', 'l2_leaf_reg': 2.1542798306067823, 'learning_rate': 0.06864209415792857, 'max_ctr_complexity': 4, 'one_hot_max_size': 10, 'ag_args': {'name_suffix': '_r177', 'priority': -1}}, {'depth': 8, 'grow_policy': 'Depthwise', 'l2_leaf_reg': 2.7997999596449104, 'learning_rate': 0.031375015734637225, 'max_ctr_complexity': 2, 'one_hot_max_size': 3, 'ag_args': {'name_suffix': '_r9', 'priority': -5}}],
'XGB': [{}, {'colsample_bytree': 0.6917311125174739, 'enable_categorical': False, 'learning_rate': 0.018063876087523967, 'max_depth': 10, 'min_child_weight': 0.6028633586934382, 'ag_args': {'name_suffix': '_r33', 'priority': -8}}, {'colsample_bytree': 0.6628423832084077, 'enable_categorical': False, 'learning_rate': 0.08775715546881824, 'max_depth': 5, 'min_child_weight': 0.6294123374222513, 'ag_args': {'name_suffix': '_r89', 'priority': -16}}],
'FASTAI': [{}, {'bs': 256, 'emb_drop': 0.5411770367537934, 'epochs': 43, 'layers': [800, 400], 'lr': 0.01519848858318159, 'ps': 0.23782946566604385, 'ag_args': {'name_suffix': '_r191', 'priority': -4}}, {'bs': 2048, 'emb_drop': 0.05070411322605811, 'epochs': 29, 'layers': [200, 100], 'lr': 0.08974235041576624, 'ps': 0.10393466140748028, 'ag_args': {'name_suffix': '_r102', 'priority': -11}}],
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
AutoGluon will fit 3 stack levels (L1 to L3) ...
Fitting 110 L1 models, fit_strategy="sequential" ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 1177.27s of the 2649.49s of remaining time.
0.576 = Validation score (roc_auc)
0.1s = Training runtime
0.73s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 1172.88s of the 2645.11s of remaining time.
0.5757 = Validation score (roc_auc)
0.1s = Training runtime
0.66s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 1171.89s of the 2644.11s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 12.73% memory usage per fold, 50.91%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=12.73%)
0.6945 = Validation score (roc_auc)
13.44s = Training runtime
0.46s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 1152.53s of the 2624.75s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 16.03% memory usage per fold, 64.12%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=16.03%)
0.6954 = Validation score (roc_auc)
13.41s = Training runtime
0.43s = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ... Training model for up to 1133.70s of the 2605.93s of remaining time.
0.6826 = Validation score (roc_auc)
10.53s = Training runtime
9.0s = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ... Training model for up to 1113.15s of the 2585.38s of remaining time.
0.6845 = Validation score (roc_auc)
12.15s = Training runtime
8.36s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 1091.55s of the 2563.77s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 1 folds in parallel instead (Estimated 60.11% memory usage per fold, 60.11%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (1 workers, per: cpus=6, gpus=0, memory=60.11%)
Switching to pseudo sequential ParallelFoldFittingStrategy to avoid Python memory leakage.
Overrule this behavior by setting fold_fitting_strategy to 'sequential_local' in ag_args_ensemble when when calling `predictor.fit`
0.6983 = Validation score (roc_auc)
132.55s = Training runtime
1.37s = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ... Training model for up to 953.60s of the 2425.83s of remaining time.
0.6758 = Validation score (roc_auc)
11.45s = Training runtime
7.15s = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ... Training model for up to 934.13s of the 2406.35s of remaining time.
0.6806 = Validation score (roc_auc)
12.1s = Training runtime
7.62s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 913.46s of the 2385.68s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=6.16%)
0.6231 = Validation score (roc_auc)
81.84s = Training runtime
1.73s = Validation runtime
Fitting model: XGBoost_BAG_L1 ... Training model for up to 823.62s of the 2295.85s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=7.15%)
0.6923 = Validation score (roc_auc)
47.71s = Training runtime
0.61s = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ... Training model for up to 767.82s of the 2240.04s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.12%)
0.6888 = Validation score (roc_auc)
56.36s = Training runtime
0.85s = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ... Training model for up to 703.63s of the 2175.85s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 13.32% memory usage per fold, 53.28%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=13.32%)
0.6866 = Validation score (roc_auc)
29.96s = Training runtime
0.67s = Validation runtime
Fitting model: CatBoost_r177_BAG_L1 ... Training model for up to 667.72s of the 2139.95s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 19.65% memory usage per fold, 78.61%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=19.65%)
0.6984 = Validation score (roc_auc)
44.74s = Training runtime
1.76s = Validation runtime
Fitting model: NeuralNetTorch_r79_BAG_L1 ... Training model for up to 616.62s of the 2088.85s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.20%)
0.6536 = Validation score (roc_auc)
62.85s = Training runtime
0.89s = Validation runtime
Fitting model: LightGBM_r131_BAG_L1 ... Training model for up to 545.57s of the 2017.79s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=6.53%)
0.6946 = Validation score (roc_auc)
10.85s = Training runtime
1.29s = Validation runtime
Fitting model: NeuralNetFastAI_r191_BAG_L1 ... Training model for up to 526.30s of the 1998.53s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.32%)
0.6101 = Validation score (roc_auc)
140.0s = Training runtime
2.18s = Validation runtime
Fitting model: CatBoost_r9_BAG_L1 ... Training model for up to 379.89s of the 1852.11s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 34.23% memory usage per fold, 68.47%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=34.23%)
0.6971 = Validation score (roc_auc)
184.69s = Training runtime
1.67s = Validation runtime
Fitting model: LightGBM_r96_BAG_L1 ... Training model for up to 190.17s of the 1662.40s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.88%)
0.6976 = Validation score (roc_auc)
9.5s = Training runtime
1.49s = Validation runtime
Fitting model: NeuralNetTorch_r22_BAG_L1 ... Training model for up to 174.02s of the 1646.24s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.28%)
0.6807 = Validation score (roc_auc)
71.46s = Training runtime
0.78s = Validation runtime
Fitting model: XGBoost_r33_BAG_L1 ... Training model for up to 96.13s of the 1568.35s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 1 folds in parallel instead (Estimated 49.97% memory usage per fold, 49.97%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (1 workers, per: cpus=6, gpus=0, memory=49.97%)
Switching to pseudo sequential ParallelFoldFittingStrategy to avoid Python memory leakage.
Overrule this behavior by setting fold_fitting_strategy to 'sequential_local' in ag_args_ensemble when when calling `predictor.fit`
0.6887 = Validation score (roc_auc)
103.25s = Training runtime
0.8s = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.00s of the 1460.75s of remaining time.
Ensemble Weights: {'CatBoost_r177_BAG_L1': 0.28, 'LightGBM_r96_BAG_L1': 0.24, 'NeuralNetTorch_BAG_L1': 0.12, 'LightGBM_BAG_L1': 0.08, 'RandomForestEntr_BAG_L1': 0.08, 'CatBoost_BAG_L1': 0.04, 'ExtraTreesEntr_BAG_L1': 0.04, 'NeuralNetTorch_r79_BAG_L1': 0.04, 'CatBoost_r9_BAG_L1': 0.04, 'NeuralNetTorch_r22_BAG_L1': 0.04}
0.7056 = Validation score (roc_auc)
1.47s = Training runtime
0.0s = Validation runtime
Fitting 108 L2 models, fit_strategy="sequential" ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 972.56s of the 1459.01s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=4.34%)
0.7006 = Validation score (roc_auc)
7.33s = Training runtime
0.83s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 956.40s of the 1442.84s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=4.48%)
0.6984 = Validation score (roc_auc)
8.17s = Training runtime
0.62s = Validation runtime
Fitting model: RandomForestGini_BAG_L2 ... Training model for up to 938.13s of the 1424.57s of remaining time.
0.6964 = Validation score (roc_auc)
8.81s = Training runtime
7.19s = Validation runtime
Fitting model: RandomForestEntr_BAG_L2 ... Training model for up to 921.08s of the 1407.52s of remaining time.
0.6968 = Validation score (roc_auc)
8.43s = Training runtime
6.65s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 905.16s of the 1391.60s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 23.25% memory usage per fold, 46.49%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=23.25%)
0.7037 = Validation score (roc_auc)
53.53s = Training runtime
1.89s = Validation runtime
Fitting model: ExtraTreesGini_BAG_L2 ... Training model for up to 846.98s of the 1333.42s of remaining time.
0.696 = Validation score (roc_auc)
12.57s = Training runtime
7.44s = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L2 ... Training model for up to 825.88s of the 1312.32s of remaining time.
0.6936 = Validation score (roc_auc)
11.61s = Training runtime
7.31s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L2 ... Training model for up to 805.91s of the 1292.35s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=3.15%)
0.6164 = Validation score (roc_auc)
71.66s = Training runtime
1.65s = Validation runtime
Fitting model: XGBoost_BAG_L2 ... Training model for up to 725.05s of the 1211.50s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=8.40%)
0.6956 = Validation score (roc_auc)
34.53s = Training runtime
0.67s = Validation runtime
Fitting model: NeuralNetTorch_BAG_L2 ... Training model for up to 682.63s of the 1169.08s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.36%)
0.6779 = Validation score (roc_auc)
64.64s = Training runtime
0.71s = Validation runtime
Fitting model: LightGBMLarge_BAG_L2 ... Training model for up to 610.36s of the 1096.80s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 14.87% memory usage per fold, 59.50%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=14.87%)
0.6941 = Validation score (roc_auc)
29.61s = Training runtime
0.51s = Validation runtime
Fitting model: CatBoost_r177_BAG_L2 ... Training model for up to 576.50s of the 1062.94s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 21.55% memory usage per fold, 43.10%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=21.55%)
0.7034 = Validation score (roc_auc)
54.54s = Training runtime
1.67s = Validation runtime
Fitting model: NeuralNetTorch_r79_BAG_L2 ... Training model for up to 517.10s of the 1003.54s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.36%)
0.675 = Validation score (roc_auc)
50.12s = Training runtime
1.12s = Validation runtime
Fitting model: LightGBM_r131_BAG_L2 ... Training model for up to 457.13s of the 943.57s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=7.32%)
0.6975 = Validation score (roc_auc)
10.98s = Training runtime
0.82s = Validation runtime
Fitting model: NeuralNetFastAI_r191_BAG_L2 ... Training model for up to 438.90s of the 925.34s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.64%)
0.6253 = Validation score (roc_auc)
153.24s = Training runtime
2.7s = Validation runtime
Fitting model: CatBoost_r9_BAG_L2 ... Training model for up to 278.29s of the 764.73s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 38.14% memory usage per fold, 76.28%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=38.14%)
0.7003 = Validation score (roc_auc)
155.25s = Training runtime
1.9s = Validation runtime
Fitting model: LightGBM_r96_BAG_L2 ... Training model for up to 115.90s of the 602.34s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.65%)
0.7019 = Validation score (roc_auc)
6.53s = Training runtime
0.85s = Validation runtime
Fitting model: NeuralNetTorch_r22_BAG_L2 ... Training model for up to 101.32s of the 587.76s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.22%)
0.7141 = Validation score (roc_auc)
88.44s = Training runtime
1.32s = Validation runtime
Fitting model: XGBoost_r33_BAG_L2 ... Training model for up to 5.61s of the 492.05s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 1 folds in parallel instead (Estimated 56.21% memory usage per fold, 56.21%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (1 workers, per: cpus=6, gpus=0, memory=56.21%)
Switching to pseudo sequential ParallelFoldFittingStrategy to avoid Python memory leakage.
Overrule this behavior by setting fold_fitting_strategy to 'sequential_local' in ag_args_ensemble when when calling `predictor.fit`
0.6385 = Validation score (roc_auc)
36.6s = Training runtime
0.67s = Validation runtime
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.00s of the 450.22s of remaining time.
Ensemble Weights: {'CatBoost_BAG_L2': 0.5, 'NeuralNetTorch_r22_BAG_L2': 0.5}
0.7263 = Validation score (roc_auc)
1.44s = Training runtime
0.0s = Validation runtime
Fitting 108 L3 models, fit_strategy="sequential" ...
Fitting model: LightGBMXT_BAG_L3 ... Training model for up to 448.74s of the 448.53s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=4.14%)
0.7204 = Validation score (roc_auc)
8.83s = Training runtime
1.09s = Validation runtime
Fitting model: LightGBM_BAG_L3 ... Training model for up to 427.50s of the 427.29s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=4.21%)
0.722 = Validation score (roc_auc)
8.75s = Training runtime
0.64s = Validation runtime
Fitting model: RandomForestGini_BAG_L3 ... Training model for up to 408.96s of the 408.75s of remaining time.
0.7102 = Validation score (roc_auc)
8.57s = Training runtime
6.13s = Validation runtime
Fitting model: RandomForestEntr_BAG_L3 ... Training model for up to 393.11s of the 392.90s of remaining time.
0.7115 = Validation score (roc_auc)
8.03s = Training runtime
5.8s = Validation runtime
Fitting model: CatBoost_BAG_L3 ... Training model for up to 378.52s of the 378.30s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 21.74% memory usage per fold, 43.48%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=21.74%)
0.7254 = Validation score (roc_auc)
43.53s = Training runtime
1.22s = Validation runtime
Fitting model: ExtraTreesGini_BAG_L3 ... Training model for up to 331.74s of the 331.52s of remaining time.
0.7063 = Validation score (roc_auc)
9.12s = Training runtime
5.58s = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L3 ... Training model for up to 316.34s of the 316.12s of remaining time.
0.708 = Validation score (roc_auc)
9.42s = Training runtime
5.85s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L3 ... Training model for up to 300.39s of the 300.17s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=2.53%)
0.6154 = Validation score (roc_auc)
70.78s = Training runtime
1.62s = Validation runtime
Fitting model: XGBoost_BAG_L3 ... Training model for up to 221.52s of the 221.31s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=9.09%)
0.7217 = Validation score (roc_auc)
36.58s = Training runtime
0.87s = Validation runtime
Fitting model: NeuralNetTorch_BAG_L3 ... Training model for up to 177.98s of the 177.76s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.47%)
0.7154 = Validation score (roc_auc)
58.22s = Training runtime
0.86s = Validation runtime
Fitting model: LightGBMLarge_BAG_L3 ... Training model for up to 110.94s of the 110.72s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 4 folds in parallel instead (Estimated 15.97% memory usage per fold, 63.90%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (4 workers, per: cpus=3, gpus=0, memory=15.97%)
0.7169 = Validation score (roc_auc)
33.2s = Training runtime
0.54s = Validation runtime
Fitting model: CatBoost_r177_BAG_L3 ... Training model for up to 73.04s of the 72.83s of remaining time.
Memory not enough to fit 8 folds in parallel. Will train 2 folds in parallel instead (Estimated 22.37% memory usage per fold, 44.74%/80.00% total).
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (2 workers, per: cpus=6, gpus=0, memory=22.37%)
0.7256 = Validation score (roc_auc)
54.74s = Training runtime
1.78s = Validation runtime
Fitting model: NeuralNetTorch_r79_BAG_L3 ... Training model for up to 12.85s of the 12.64s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy (8 workers, per: cpus=1, gpus=0, memory=1.62%)
0.682 = Validation score (roc_auc)
16.25s = Training runtime
0.98s = Validation runtime
Fitting model: WeightedEnsemble_L4 ... Training model for up to 360.00s of the -12.66s of remaining time.
Ensemble Weights: {'NeuralNetTorch_r22_BAG_L2': 0.24, 'CatBoost_BAG_L2': 0.2, 'CatBoost_r177_BAG_L3': 0.16, 'LightGBM_BAG_L3': 0.12, 'XGBoost_BAG_L3': 0.12, 'XGBoost_BAG_L1': 0.08, 'NeuralNetTorch_BAG_L1': 0.04, 'CatBoost_BAG_L3': 0.04}
0.7276 = Validation score (roc_auc)
2.06s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 2681.28s ... Best model: WeightedEnsemble_L4 | Estimated inference throughput: 35.8 rows/s (1647 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("c:\studium\DAV Ausbildung\CADS\Immersion\ads3_exam\03_scripts\autogluon_diabetes_auc")
Elapsed time (sec): 3615.57
# View leaderboard
leaderboard = AG.leaderboard(Xy_val, silent=True)
leaderboard.head(15)
| model | score_test | score_val | eval_metric | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LightGBM_BAG_L2 | 0.706243 | 0.698350 | roc_auc | 36.435412 | 50.441354 | 1027.245120 | 1.341141 | 0.623350 | 8.168139 | 2 | True | 24 |
| 1 | LightGBM_r96_BAG_L2 | 0.705856 | 0.701887 | roc_auc | 37.057245 | 50.668730 | 1025.605496 | 1.962975 | 0.850726 | 6.528515 | 2 | True | 39 |
| 2 | CatBoost_BAG_L2 | 0.705774 | 0.703741 | roc_auc | 36.082140 | 51.708746 | 1072.608280 | 0.987869 | 1.890742 | 53.531299 | 2 | True | 27 |
| 3 | CatBoost_r177_BAG_L2 | 0.705515 | 0.703445 | roc_auc | 36.313934 | 51.491967 | 1073.612370 | 1.219664 | 1.673963 | 54.535389 | 2 | True | 34 |
| 4 | RandomForestEntr_BAG_L1 | 0.705087 | 0.684488 | roc_auc | 0.566349 | 8.355005 | 12.145532 | 0.566349 | 8.355005 | 12.145532 | 1 | True | 6 |
| 5 | LightGBM_r131_BAG_L2 | 0.704889 | 0.697478 | roc_auc | 36.857112 | 50.634384 | 1030.057138 | 1.762841 | 0.816380 | 10.980157 | 2 | True | 36 |
| 6 | WeightedEnsemble_L2 | 0.704813 | 0.705625 | roc_auc | 13.358419 | 25.202521 | 601.281765 | 0.042000 | 0.000000 | 1.471262 | 2 | True | 22 |
| 7 | LightGBMXT_BAG_L2 | 0.704788 | 0.700562 | roc_auc | 36.396357 | 50.646163 | 1026.403245 | 1.302086 | 0.828159 | 7.326264 | 2 | True | 23 |
| 8 | CatBoost_r9_BAG_L2 | 0.704519 | 0.700336 | roc_auc | 36.573185 | 51.716511 | 1174.326463 | 1.478915 | 1.898508 | 155.249482 | 2 | True | 38 |
| 9 | XGBoost_BAG_L2 | 0.704185 | 0.695643 | roc_auc | 37.234827 | 50.486380 | 1053.603121 | 2.140556 | 0.668376 | 34.526139 | 2 | True | 31 |
| 10 | LightGBMLarge_BAG_L1 | 0.704128 | 0.686622 | roc_auc | 1.641981 | 0.673794 | 29.963329 | 1.641981 | 0.673794 | 29.963329 | 1 | True | 13 |
| 11 | RandomForestGini_BAG_L1 | 0.704010 | 0.682610 | roc_auc | 0.662482 | 9.003107 | 10.533237 | 0.662482 | 9.003107 | 10.533237 | 1 | True | 5 |
| 12 | ExtraTreesEntr_BAG_L2 | 0.703902 | 0.693596 | roc_auc | 35.602563 | 57.126904 | 1030.688657 | 0.508293 | 7.308901 | 11.611676 | 2 | True | 29 |
| 13 | LightGBMLarge_BAG_L2 | 0.703586 | 0.694109 | roc_auc | 36.754710 | 50.329182 | 1048.688829 | 1.660439 | 0.511178 | 29.611848 | 2 | True | 33 |
| 14 | LightGBM_r131_BAG_L1 | 0.703490 | 0.694592 | roc_auc | 2.158012 | 1.291697 | 10.845987 | 2.158012 | 1.291697 | 10.845987 | 1 | True | 16 |
# Evaluate on validation data
performance = AG.evaluate(Xy_val)
# Extract and print the validation AUC score
auc_AG = performance['roc_auc']
print(f'The validation AUC of the best model from AutoGluon is: {auc_AG:.6f}')
The validation AUC of the best model from AutoGluon is: 0.700962
Unfortunately, the validation AUC score does not show improvement compared to the AUC of our initial baseline model. Nevertheless, we proceed with the best model ensemble selected by AutoGluon, which, in this case, corresponds to the seventh model on the leaderboard, namely the WeightedEnsemble_L2.
# Choose the best ensemble model (here the random forest)
WE = leaderboard.iloc[6]['model']
performance = AG.evaluate(Xy_val, model=WE)
# Extract and print the validation AUC score
auc_WE = performance['roc_auc']
print(f'The validation AUC of the best model from AutoGluon is: {auc_WE:.6f}')
The validation AUC of the best model from AutoGluon is: 0.704813
11.1 Model evaluation: Validation and test set
Now that the modeling is complete, it is time to evaluate our models on test data and compare the results with the validation data and between the different models of interest. We consider all models created or optimized in the previous sections.
# Feature scaling using StandardScaler based on training data distributions (6.1)
X_test = pd.DataFrame(scaler.transform(X_test), columns = feature_names)
# Make preditions for test data
y_test_CB1 = CB1.predict_proba(X_raw_test)
y_test_LR1 = LR1.predict(X_pre_test)
y_test_CB4 = CB4.predict_proba(X_test)
y_test_CB5 = CB5.predict_proba(X_test)
y_test_CB6 = CB6.predict_proba(X_test)
y_test_LGB = LGB.predict_proba(X_test)
y_test_XGB = XGB.predict_proba(X_test)
y_test_ANN = ANN.predict(X2_test_ANN)
y_test_LR2 = LR2.predict(X2_test)
y_test_CB7 = CB7.predict_proba(X2_test)
y_test_CB8 = CB8.predict_proba(X2_all_test)
y_test_WE = AG.predict_proba(X_test, model=WE)
224/224 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
# Calulate model AUC based on test data
auc_CB1_test = roc_auc_score(y_test, y_test_CB1[:, 1])
auc_LR1_test = roc_auc_score(y_test, y_test_LR1)
# Calculate the further model test AUC
auc_CB4_test = roc_auc_score(y_test, y_test_CB4[:, 1])
auc_CB5_test = roc_auc_score(y_test, y_test_CB5[:, 1])
auc_CB6_test = roc_auc_score(y_test, y_test_CB6[:, 1])
auc_LGB_test = roc_auc_score(y_test, y_test_LGB[:, 1])
auc_XGB_test = roc_auc_score(y_test, y_test_XGB[:, 1])
auc_ANN_test = roc_auc_score(y2_test, y_test_ANN[:, 0])
auc_LR2_test = roc_auc_score(y2_test, y_test_LR2)
auc_CB7_test = roc_auc_score(y2_test, y_test_CB7[:, 1])
auc_CB8_test = roc_auc_score(y2_test, y_test_CB8[:, 1])
auc_WE_test = roc_auc_score(y_test, y_test_WE[1])
# Data structures for model names, val/test-sample and AUC
mdict = {'Model name': [], 'Set': [], 'AUC': []}
# List of models and their AUC scores on validation and test sets
models_auc = [
("CB1_quick", auc_CB1, auc_CB1_test),
("LR1_select", auc_LR1, auc_LR1_test),
("CB4_OHE", auc_CB4, auc_CB4_test),
("CB5_subsample", auc_CB5, auc_CB5_test),
("CB6_tuned", auc_CB6, auc_CB6_test),
("LGB_tuned", auc_LGB, auc_LGB_test),
("XGB_tuned", auc_XGB, auc_XGB_test),
("ANN_tuned", auc_ANN, auc_ANN_test),
("LR2_with_embeddings", auc_LR2, auc_LR2_test),
("CB7_with_embeddings", auc_CB7, auc_CB7_test),
("CB8_quick_with_embeddings", auc_CB8, auc_CB8_test),
("WeightedEnsemble", auc_WE, auc_WE_test),
]
# Populate the dictionary by looping over the models and their AUC scores
for model_name, auc_val, auc_test in models_auc:
mdict['Model name'].extend([model_name] * 2) # Model name twice for Val. and Test
mdict['Set'].extend(["Val.", "Test"])
mdict['AUC'].extend([auc_val, auc_test])
df_eval = pd.DataFrame(mdict)
plt.title("Model evaluation (AUC):")
sns.barplot(data=df_eval, x="AUC", y="Model name",hue="Set")
plt.xlim(0.50, 0.76)
plt.show()
We observe that CB1_quick, CB5_subsample, and WeightedEnsemble achieve the highest AUC scores on the test set, all slightly exceeding 0.70. Notably, models incorporating embeddings (such as CB8_quick_with_embeddings, LR2_with_embeddings) generally match or outperform their counterparts without embeddings in terms of test AUC. The logistic regression variants (LR1_select and LR2_with_embeddings) deliver consistent but relatively modest performance, with test AUC scores in the range of 0.66 to 0.68.
Interestingly, WeightedEnsemble achieves not only the highest test AUC but also maintains comparable validation performance, indicating good generalization. In contrast, some models (e.g., LR1_select) show a larger gap between validation and test performance, suggesting limited generalizability.
In conclusion, boosted tree models, even with minimal preprocessing, deliver the strongest results overall. Only some models exhibit a slight decline in AUC from validation to test set, which is typical and expected. However, the strong and consistent performance of WeightedEnsemble across both sets suggests that its automated ensembling and tuning mechanisms are highly effective in this context.
11.2 High risk prediction
In Subsection 11.1, we considered a model with the highest area under the Receiver Operating Characteristic Curve (AUC) to be the best based on all readmissions from low to high risk. Since the high-risk area usually receives the most attention for economic reasons, the performance of the models in this area is more decisive for the application. The lift chart is suitable for this evaluation. A cumulative lift chart shows the improvement that a model offers compared to a random estimate and measures the change in the form of a lift scores.
# Core of scikit-plot function cumulative_gain_curve (necessary, since scypi release 1.12.0 broke compatibility with scikit-plot version 0.3.7.
# Source: https://github.com/reiinakano/scikit-plot/blob/26007fbf9f05e915bd0f6acb86850b01b00944cf/scikitplot/helpers.py
def cumulative_gain_curve1(y_true, y_score):
"""This binary classification function generates the points necessary to plot the Cumulative Gain"""
y_true, y_score = np.asarray(y_true), np.asarray(y_score)
# make y_true a boolean vector
y_true = (y_true == 1)
sorted_indices = np.argsort(y_score)[::-1]
y_true = y_true[sorted_indices]
gains = np.cumsum(y_true)
percentages = np.arange(start=1, stop=len(y_true) + 1)
gains = gains / float(np.sum(y_true))
percentages = percentages / float(len(y_true))
gains = np.insert(gains, 0, [0])
percentages = np.insert(percentages, 0, [0])
return percentages, gains
# Calculate cumulative gains for lift chart
def calculate_gains_and_percentages(model_predictions, y_true):
"""Calculate gains and percentages for a given model's predictions."""
percentages, gains = cumulative_gain_curve1(y_true, model_predictions)
gains_adjusted = gains[1:] / percentages[1:] # Adjust gains starting from the second element
return percentages[1:], gains_adjusted
# Select models and prepare true and predicted values
y_true = np.array(np.ravel(y_test))
model_predictions = {
'CB1_quick': np.array(y_test_CB1[:, 1]),
'CB5_subsample': np.array(y_test_CB5[:, 1]),
'LGB_tuned': np.array(y_test_LGB[:, 1]),
'XGB_tuned': np.array(y_test_XGB[:, 1]),
'LR2_with_embeddings': np.array(y_test_LR2),
'CB7_with_embeddings': np.array(y_test_CB7[:, 1]),
'CB8_with_embeddings': np.array(y_test_CB8[:, 1]),
'WeightedEnsemble': np.array(y_test_WE[1]),
}
# Initialize lift chart
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_title("Lift chart for high risk area (top decile)")
ax.set_xlabel('Fraction of high risk readmissions')
ax.set_ylabel('Lift')
plt.xlim(0, 0.15)
plt.ylim(0, 8.0)
# Calculate and plot lift score for each model
for label, predictions in model_predictions.items():
percentages, gains_adjusted = calculate_gains_and_percentages(predictions, y_true)
ax.plot(percentages, gains_adjusted, lw=2, label=label)
# Plot random choice baseline
ax.plot([0, 1], [1, 1], 'k--', lw=2, label='random choice')
ax.legend(loc='upper right')
plt.show()
The cumulative lift chart illustrates the factor (lift score) by which the model enhances the "hit rate". For instance, targeting the top 10% of cases with the highest predicted probability of readmission, where the lift score is around 3, could capture 30% of all actual readmissions.
In our case, two standout models emerge in the top decile: the initial "quick & easy" baseline model and the CatBoost model with embeddings. However, the baseline model consistently performs better across most of the high-risk region. Thus, we can take this baseline model for defining the top risk segment. In the following analysis, we focus on the top decile and generate a list of the highest-risk readmission cases. Since we have the true outcomes in our test set, we compare these against the predicted probabilities.
# Merge predicted an true values into a DataFrame
df = pd.DataFrame({
'PredProbability': y_test_CB1[:,1],
'TrueValue': y_test
})
# Sort by Probability in descending order
df_sorted = df.sort_values(by='PredProbability', ascending=False)
df_sorted.head()
| PredProbability | TrueValue | |
|---|---|---|
| 34138 | 0.888056 | 1 |
| 33582 | 0.885106 | 1 |
| 45194 | 0.857854 | 1 |
| 33042 | 0.853814 | 1 |
| 32855 | 0.846220 | 0 |
# Creating a 'Percentile' column based on quantile-based discretization into 100 bins
df_sorted['Percentile'] = pd.qcut(df_sorted['PredProbability'], 100, labels=range(1, 101))
# Calculate means of predicted and true values for the top 10 percentiles
percentile_means = df_sorted.groupby('Percentile', observed=True).mean().tail(10)
# Printing the results for the top 10 percentiles
print(percentile_means[::-1]) # This reverses the order to start with the highest probabilities
PredProbability TrueValue Percentile 100 0.632874 0.569444 99 0.466461 0.527778 98 0.405462 0.338028 97 0.368264 0.361111 96 0.339783 0.375000 95 0.314150 0.253521 94 0.290905 0.347222 93 0.271439 0.267606 92 0.256960 0.305556 91 0.245194 0.277778
Although the predicted probabilities may deviate a little too much in most percentiles, the percentile ranking itself is highly effective. This ranking can be reliably used to guide decision-making, particularly when considering the varying and often substantial costs associated with inpatient care for individuals with diabetes mellitus.